PreviousNextIndex

Chapter 3: Investigating System Components and Solving Common Problems


This chapter guides you through the steps to investigate system components, identify problems that you may encounter as you use HACMP, and offers possible solutions.

Overview

If no error messages are displayed on the console and if examining the log files proves fruitless, you next investigate each component of your HACMP environment and eliminate it as the cause of the problem. The first section of this chapter reviews methods for investigating system components, including the RSCT subsystem. It includes these sections:

  • Investigating System Components
  • Checking Highly Available Applications
  • Checking the HACMP Layer
  • Checking the Logical Volume Manager
  • Checking the TCP/IP Subsystem
  • Checking the AIX 5L Operating System
  • Checking Physical Networks
  • Checking Disks, Disk Adapters, and Disk Heartbeating Networks
  • Checking the Cluster Communications Daemon
  • Checking System Hardware.
  • The second section provides recommendations for investigating the following areas:

  • HACMP Installation Issues
  • HACMP Startup Issues
  • Disk and Filesystem Issues
  • Network and Switch Issues
  • Cluster Communications Issues
  • HACMP Takeover Issues
  • Client Issues
  • Miscellaneous Issues.
  • Investigating System Components

    Both HACMP and AIX 5L provide utilities you can use to determine the state of an HACMP cluster and the resources within that cluster. Using these commands, you can gather information about volume groups or networks. Your knowledge of the HACMP system is essential. You must know the characteristics of a normal cluster beforehand and be on the lookout for deviations from the norm as you examine the cluster components. Often, the surviving cluster nodes can provide an example of the correct setting for a system parameter or for other cluster configuration information.

    The following sections review the HACMP cluster components that you can check and describes some useful utilities. If examining the cluster log files does not reveal the source of a problem, investigate each system component using a top-down strategy to move through the layers. You should investigate the components in the following order:

      1. Application layer
      2. HACMP layer
      3. Logical Volume Manager layer
      4. TCP/IP layer
      5. AIX 5L layer
      6. Physical network layer
      7. Physical disk layer
      8. System hardware layer.

    The following sections describe what you should look for when examining each layer. They also briefly describe the tools you should use to examine the layers.

    Checking Highly Available Applications

    As a first step to finding problems affecting a cluster, check each highly available application running on the cluster. Examine any application-specific log files and perform any troubleshooting procedures recommended in the application’s documentation. In addition, check the following:

  • Do some simple tests; for example, for a database application try to add and delete a record.
  • Use the ps command to check that the necessary processes are running, or to verify that the processes were stopped properly.
  • Check the resources that the application expects to be present to ensure that they are available, the filesystems and volume groups for example.
  • Checking the HACMP Layer

    If checking the application layer does not reveal the source of a problem, check the HACMP layer. The two main areas to investigate are:

  • HACMP components and required files
  • Cluster topology and configuration.
  • The following sections describe how to investigate these problems.

    Note: These steps assume that you have checked the log files and that they do not point to the problem.

    Checking HACMP Components

    An HACMP cluster is made up of several required files and daemons. The following sections describe what to check for in the HACMP layer.

    Checking HACMP Required Files

    Make sure that the HACMP files required for your cluster are in the proper place, have the proper permissions (readable and executable), and are not zero length. The HACMP files and the AIX 5L files modified by the HACMP software are listed in the README file that accompanies the product.

    Checking Cluster Services and Processes

    Check the status of the following HACMP daemons:

  • The Cluster Manager (clstrmgrES) daemon
  • The Cluster Communications (clcomdES) daemon
  • The Cluster Information Program (clinfoES) daemon.
  • When these components are not responding normally, determine if the daemons are active on a cluster node. Use either the options on the SMIT System Management (C-SPOC) > Manage HACMP Services > Show Cluster Services panel or the lssrc command.

    For example, to check on the status of all daemons under the control of the SRC, enter:

    lssrc -a | grep active 
    syslogd          ras              290990       active 
     sendmail         mail             270484       active 
     portmap          portmap          286868       active 
     inetd            tcpip            295106       active 
     snmpd            tcpip            303260       active 
     dpid2            tcpip            299162       active 
     hostmibd         tcpip            282812       active 
     aixmibd          tcpip            278670       active 
     biod             nfs              192646       active 
     rpc.statd        nfs              254122       active 
     rpc.lockd        nfs              274584       active 
     qdaemon          spooler          196720       active 
     writesrv         spooler          250020       active 
     ctrmc            rsct             98392        active 
     clcomdES         clcomdES         204920       active 
     IBM.CSMAgentRM   rsct_rm          90268        active 
     IBM.ServiceRM    rsct_rm          229510       active 
     IBM.ERRM         rsct_rm          188602       active 
     IBM.AuditRM      rsct_rm          151722       active 
     topsvcs          topsvcs          602292       active 
     grpsvcs          grpsvcs          569376       active 
     emsvcs           emsvcs           561188       active 
     emaixos          emsvcs           557102       active 
     clstrmgrES       cluster          544802       active 
     gsclvmd                           565356       active 
     IBM.HostRM       rsct_rm          442380       active 
    

    To check on the status of all cluster daemons under the control of the SRC, enter:

    lssrc -g cluster 
    
    Note: When you use the -g flag with the lssrc command, the status information does not include the status of subsystems if they are inactive. If you need this information, use the -a flag instead. For more information on the lssrc command, see the man page.

    To view additional information on the status of a daemon run the clcheck_server command. The clcheck_server command makes additional checks and retries beyond what is done by lssrc command. For more information, see the clcheck_server man page.

    To determine whether the Cluster Manager is running, or if processes started by the Cluster Manager are currently running on a node, use the ps command.

    For example, to determine whether the clstrmgrES daemon is running, enter:

    ps -ef | grep clstrmgrES 
    root  18363  3346  3 11:02:05    -  10:20 
    /usr/es/sbin/cluster/clstrmgrES 
    root  19028  19559 2 16:20:04 pts/10  0:00 grep clstrmgrES  
    

    See the ps man page for more information on using this command.

    Checking for Cluster Configuration Problems

    For an HACMP cluster to function properly, all the nodes in the cluster must agree on the cluster topology, network configuration, and ownership and takeover of HACMP resources. This information is stored in the Configuration Database on each cluster node.

    To begin checking for configuration problems, ask yourself if you (or others) have made any recent changes that may have disrupted the system. Have components been added or deleted? Has new software been loaded on the machine? Have new PTFs or application updates been performed? Has a system backup been restored? Then run verification to ensure that the proper HACMP-specific modifications to AIX 5L software are in place and that the cluster configuration is valid.

    The cluster verification utility checks many aspects of a cluster configuration and reports any inconsistencies. Using this utility, you can perform the following tasks:

  • Verify that all cluster nodes contain the same cluster topology information
  • Check that all network interface cards and tty lines are properly configured, and that shared disks are accessible to all nodes that can own them
  • Check each cluster node to determine whether multiple RS232 non-IP networks exist on the same tty device
  • Check for agreement among all nodes on the ownership of defined resources, such as filesystems, log files, volume groups, disks, and application servers
  • Check for invalid characters in cluster names, node names, network names, network interface names and resource group names
  • Verify takeover information.
  • The verification utility will also print out diagnostic information about the following:

  • Custom snapshot methods
  • Custom verification methods
  • Custom pre or post events
  • Cluster log file redirection.
  • If you have configured Kerberos on your system, the verification utility also determines that:

  • All IP labels listed in the configuration have the appropriate service principals in the .klogin file on each node in the cluster
  • All nodes have the proper service principals
  • Kerberos is installed on all nodes in the cluster
  • All nodes have the same security mode setting.
  • From the main HACMP SMIT panel, select Problem Determination Tools > HACMP Verification > Verify HACMP Configuration. If you find a configuration problem, correct it, then resynchronize the cluster.

    Note: Some errors require that you make changes on each cluster node. For example, a missing application start script or a volume group with autovaryon=TRUE requires a correction on each affected node. Some of these issues can be taken care of by using HACMP File Collections.

    For more information about using the cluster verification utility and HACMP File Collections, see Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide.

    Run the /usr/es/sbin/cluster/utilities/cltopinfo command to see a complete listing of cluster topology. In addition to running the HACMP verification process, check for recent modifications to the node configuration files.

    The command ls -lt /etc will list all the files in the /etc directory and show the most recently modified files that are important to configuring AIX 5L, such as:

  • etc/inett.conf
  • etc/hosts
  • etc/services.
  • It is also very important to check the resource group configuration for any errors that may not be flagged by the verification process. For example, make sure the filesystems required by the application servers are included in the resource group with the application.

    Check that the nodes in each resource group are the ones intended, and that the nodes are listed in the proper order. To view the cluster resource configuration information from the main HACMP SMIT panel, select Extended Configuration > Extended Resource Configuration > HACMP Extended Resource Group Configuration > Show All Resources by Node or Resource Group.

    You can also run the /usr/es/sbin/cluster/utilities/clRGinfo command to see the resource group information.

    Note: If cluster configuration problems arise after running the cluster verification utility, do not run C-SPOC commands in this environment as they may fail to execute on cluster nodes.

    Checking a Cluster Snapshot File

    The HACMP cluster snapshot facility (/usr/es/sbin/cluster/utilities/clsnapshots) allows you to save in a file, a record all the data that defines a particular cluster configuration. It also allows you to create your own custom snapshot methods, to save additional information important to your configuration. You can use this snapshot for troubleshooting cluster problems. The default directory path for storage and retrieval of a snapshot is /usr/es/sbin/cluster/snapshots.

    Note that you cannot use the cluster snapshot facility in a cluster that is running different versions of HACMP concurrently.

    For information on how to create and apply cluster snapshots, see Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide.

    Information Saved in a Cluster Snapshot

    The primary information saved in a cluster snapshot is the data stored in the HACMP Configuration Database classes (such as HACMPcluster, HACMPnode, and HACMPnetwork). This is the information used to recreate the cluster configuration when a cluster snapshot is applied.

    The cluster snapshot does not save any user-customized scripts, applications, or other non-HACMP configuration parameters. For example, the name of an application server and the location of its start and stop scripts are stored in the HACMPserver Configuration Database object class. However, the scripts themselves as well as any applications they may call are not saved.

    The cluster snapshot does not save any device data or configuration-specific data that is outside the scope of HACMP. For instance, the facility saves the names of shared filesystems and volume groups; however, other details, such as NFS options or LVM mirroring configuration are not saved.

    If you moved resource groups using the Resource Group Management utility clRGmove, once you apply a snapshot, the resource groups return to behaviors specified by their default nodelists. To investigate a cluster after a snapshot has been applied, run clRGinfo to view the locations and states of resource groups.

    In addition to this Configuration Database data, a cluster snapshot also includes output generated by various HACMP and standard AIX 5L commands and utilities. This data includes the current state of the cluster, node, network, and network interfaces as viewed by each cluster node, as well as the state of any running HACMP daemons.

    The cluster snapshot includes output from the following commands:

    cllscf
    df
    lsfs
    netstat
    cllsnw
    exportfs
    lslpp
    no
    cllsif
    ifconfig
    lslv
    clchsyncd
    clshowres
    ls
    lsvg
    cltopinfo

    In HACMP 5.1 and up, by default, HACMP no longer collects cluster log files when you create the cluster snapshot, although you can still specify to do so in SMIT. Skipping the logs collection reduces the size of the snapshot and speeds up running the snapshot utility.

    You can use SMIT to collect cluster log files for problem reporting. This option is available under the Problem Determination Tools > HACMP Log Viewing and Management > Collect Cluster log files for Problem Reporting SMIT menu. It is recommended to use this option only if requested by the IBM support personnel.

    If you want to add commands to obtain site-specific information, create custom snapshot methods as described in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide.

    Note that you can also use the AIX 5L snap -e command to collect HACMP cluster data, including the hacmp.out and clstrmgr.debug log files.

    Cluster Snapshot Files

    The cluster snapshot facility stores the data it saves in two separate files, the Configuration Database data file and the Cluster State Information File, each displaying information in three sections.

    Configuration Database Data File (.odm)

    This file contains all the data stored in the HACMP Configuration Database object classes for the cluster. This file is given a user-defined basename with the .odm file extension. Because the Configuration Database information must be largely the same on every cluster node, the cluster snapshot saves the values from only one node. The cluster snapshot Configuration Database data file is an ASCII text file divided into three delimited sections:

    Version section
    This section identifies the version of the cluster snapshot. The characters <VER identify the start of this section; the characters </VER identify the end of this section. The cluster snapshot software sets the version number.
    Description section
    This section contains user-defined text that describes the cluster snapshot. You can specify up to 255 characters of descriptive text. The characters <DSC identify the start of this section; the characters </DSC identify the end of this section.
    ODM data section
    This section contains the HACMP Configuration Database object classes in generic AIX 5L ODM stanza format. The characters <ODM identify the start of this section; the characters </ODM identify the end of this section.

    The following is an excerpt from a sample cluster snapshot Configuration Database data file showing some of the ODM stanzas that are saved:

    <VER 
    1.0 
    </VER 
    <DSC 
    My Cluster Snapshot 
    </DSC 
    <ODM 
    HACMPcluster: 
    
    id = 1106245917
    name = "HA52_TestCluster"
    nodename = "mynode"
    sec_level = “Standard”
    sec_level_msg = “”
    sec_encryption = “”
    sec_persistent = “”
    last_node_ids = “”
    highest_node_id = 0
    last_network_ids = “”
    highest_network_id = 0
    last_site_ides = “”
    highest_site_id = 0
    handle = 1
    cluster_version = 7
    reserved1 = 0
    reserved2 = 0
    wlm_subdir = “”
    settling_time = o
    rg_distribution_policy = “node”
    noautoverification = 0
    clvernodename = “”
    clverhour = 0
    HACMPnode: 
    
    name = “mynode”
    object = “VERBOSE_LOGGING”
    value = “high”
    . 
    . 
    </ODM 
    

    Cluster State Information File (.info)

    This file contains the output from standard AIX 5L and HACMP system management commands. This file is given the same user-defined basename with the .info file extension. If you defined custom snapshot methods, the output from them is appended to this file. The Cluster State Information file contains three sections:

    Version section
    This section identifies the version of the cluster snapshot. The characters <VER identify the start of this section; the characters </VER identify the end of this section. The cluster snapshot software sets this section.
    Description section
    This section contains user-defined text that describes the cluster snapshot. You can specify up to 255 characters of descriptive text. The characters <DSC identify the start of this section; the characters </DSC identify the end of this section.
    Command output section
    This section contains the output generated by AIX 5L and HACMP ODM commands. This section lists the commands executed and their associated output. This section is not delimited in any way.

    Checking the Logical Volume Manager

    When troubleshooting an HACMP cluster, you need to check the following LVM entities:

  • Volume groups
  • Physical volumes
  • Logical volumes
  • Filesystems.
  • Checking Volume Group Definitions

    Check to make sure that all shared volume groups in the cluster are active on the correct node. If a volume group is not active, vary it on using the appropriate command for your configuration.

    In the SMIT panel Initialization and Standard Configuration > Configure HACMP Resource Groups > Change/Show Resources for a Resource Group (standard), all volume groups listed in the Volume Groups field for a resource group should be varied on the node(s) that have the resource group online.

    Using the lsvg Command to Check Volume Groups

    To check for inconsistencies among volume group definitions on cluster nodes, use the lsvg command to display information about the volume groups defined on each node in the cluster:

    lsvg 
    

    The system returns volume group information similar to the following:

    rootvg  
    datavg 
    

    To list only the active (varied on) volume groups in the system, use the lsvg -o command as follows:

    lsvg -o 
    

    The system returns volume group information similar to the following:

    rootvg 
    

    To list all logical volumes in the volume group, and to check the volume group status and attributes, use the lsvg -l command and specify the volume group name as shown in the following example:

    lsvg -l rootvg 
    
    Note: The volume group must be varied on to use the lsvg-l command.

    You can also use HACMP SMIT to check for inconsistencies: System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups option to display information about shared volume groups in your cluster.

    Checking the Varyon State of a Volume Group

    You may check the status of the volume group by issuing the lsvg <vgname> command. Depending on your configuration, the lsvg command returns the following options:

  • vg state could be active (if it is active varyon), or passive only (if it is passive varyon).
  • vg mode could be concurrent or enhanced concurrent.
  • Here is an example of lsvg output:

    # lsvg myvg 
    VOLUME GROUP:			Volume_Group_01 		VG IDENTIFIER:		0002231b00004c00000000f2801blcc3 
    VG STATE:			active		PP SIZE:		16 megabyte(s) 
    VG PERMISSION:			read/write		TOTAL PPs:		1084 (17344 megabytes) 
    MAX LVs:			256		FREE PPs:		977 (15632 megabytes) 
    LVs:			4		USED PPs:		107 (1712 megabytes) 
    OPEN LVs:			0		QUORUM:		2 
    TOTAL PVs:			2		VG DESCRIPTORS: 	3 
    STALE PVs:			0 		STALE PPs       	0 
    ACTIVE PVs:			2		AUTO ON:        	no 
    MAX PPs per PV			1016		MAX PVs:		32 
    LTG size:			128 kilobyte (s)	AUTO SYNC:			no 
    HOT SPARE:			no	 
    

    Using the C-SPOC Utility to Check Shared Volume Groups

    To check for inconsistencies among volume group definitions on cluster nodes in a two-node C-SPOC environment:

      1. Enter smitty hacmp
      2. In SMIT, select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups > List All Shared Volume Groups and press Enter to accept the default (no).

    A list of all shared volume groups in the C-SPOC environment appears. This list also contains enhanced concurrent volume groups included as resources in non-concurrent resource groups.

    You can also use the C-SPOC cl_lsvg command from the command line to display this information.

    Checking Physical Volumes

    To check for discrepancies in the physical volumes defined on each node, obtain a list of all physical volumes known to the systems and compare this list against the list of disks specified in the Disks field of the Command Status panel. Access the Command Status panel through the SMIT Extended Configuration > Extended Resource Configuration > HACMP Extended Resource Group Configuration > Show All Resources by Node or Resource Group panel.

    To obtain a list of all the physical volumes known to a node and to find out the volume groups to which they belong, use the lspv command. If you do not specify the name of a volume group as an argument, the lspv command displays every known physical volume in the system. For example:

    lspv 
    hdisk0      0000914312e971a   rootvg 
    hdisk1      00000132a78e213   rootvg 
    hdisk2      00000902a78e21a   datavg 
    hdisk3      00000321358e354   datavg 
    

    The first column of the display shows the logical name of the disk. The second column lists the physical volume identifier of the disk. The third column lists the volume group (if any) to which it belongs.

    Note that on each cluster node, AIX 5L can assign different names (hdisk numbers) to the same physical volume. To tell which names correspond to the same physical volume, compare the physical volume identifiers listed on each node.

    If you specify the logical device name of a physical volume (hdiskx) as an argument to the lspv command, it displays information about the physical volume, including whether it is active (varied on). For example:

    lspv hdisk2 
    PHYSICAL VOLUME:   hdisk2              VOLUME GROUP:   abalonevg 
    PV IDENTIFIER:     0000301919439ba5    VG IDENTIFIER: 00003019460f63c7 
    PV STATE:          active              VG STATE:      active/complete 
    STALE PARTITIONS:  0                   ALLOCATABLE:       yes 
    PP SIZE:           4 megabyte(s)       LOGICAL VOLUMES:   2 
    TOTAL PPs:         203 (812 megabytes) VG DESCRIPTORS:    2 
    FREE PPs:          192 (768 megabytes)                   
    USED PPs:          11 (44 megabytes)                     
    FREE DISTRIBUTION: 41..30..40..40..41                     
    USED DISTRIBUTION: 	 00..11..00..00..00          
    

    If a physical volume is inactive (not varied on, as indicated by question marks in the PV STATE field), use the appropriate command for your configuration to vary on the volume group containing the physical volume. Before doing so, however, you may want to check the system error report to determine whether a disk problem exists. Enter the following command to check the system error report:

    errpt -a|more 
    

    You can also use the lsdev command to check the availability or status of all physical volumes known to the system.

    Checking Logical Volumes

    To check the state of logical volumes defined on the physical volumes, use the lspv -l command and specify the logical name of the disk to be checked. As shown in the following example, you can use this command to determine the names of the logical volumes defined on a physical volume:

    lspv -l hdisk2  
    LV NAME		LPs	PPs	DISTRIBUTION			MOUNT POINT 
    lv02		50	50	25..00..00..00..25			/usr 
    lv04		44	44	06..00..00..32..06			/clusterfs 
    

    Use the lslv logicalvolume command to display information about the state (opened or closed) of a specific logical volume, as indicated in the LV STATE field. For example:

    lslv nodeAlv 
    LOGICAL VOLUME: nodeAlv					VOLUME GROUP:   nodeAvg 
    LV IDENTIFIER:  00003019460f63c7.1					PERMISSION:     read/write 
    VG STATE:       active/complete					LV STATE:       opened/syncd 
    TYPE:           jfs					WRITE VERIFY:   off 
    MAX LPs:        128					PP SIZE:        4 megabyte(s) 
    COPIES:         1					SCHED POLICY:   parallel 
    LPs:            10					PPs:            10 
    STALE PPs:      0					BB POLICY:      relocatable 
    INTER-POLICY:   minimum					RELOCATABLE:    yes 
    INTRA-POLICY:   middle					UPPER BOUND:    32 
    MOUNT POINT:    /nodeAfs					LABEL:          /nodeAfs 
    MIRROR WRITE CONSISTENCY: on 
    EACH LP COPY ON A SEPARATE PV ?: yes 
    

    If a logical volume state is inactive (or closed, as indicated in the LV STATE field), use the appropriate command for your configuration to vary on the volume group containing the logical volume.

    Using the C-SPOC Utility to Check Shared Logical Volumes

    To check the state of shared logical volumes on cluster nodes:

    In SMIT select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Logical Volumes > List All Shared Logical Volumes by Volume Group. A list of all shared logical volumes appears.

    You can also use the C-SPOC cl_lslv command from the command line to display this information.

    Checking Filesystems

    Check to see if the necessary filesystems are mounted and where they are mounted. Compare this information against the HACMP definitions for any differences. Check the permissions of the filesystems and the amount of space available on a filesystem.

    Use the following commands to obtain this information about filesystems:

  • The mount command
  • The df command
  • The lsfs command.
  • Use the cl_lsfs command to list filesystem information when running the C-SPOC utility.

    Obtaining a List of Filesystems

    Use the mount command to list all the filesystems, both JFS and NFS, currently mounted on a system and their mount points. For example:

    mount 
    node mounted      mounted over    vfs	date						options 
    ------------------------------------------------------------------------ 
          /dev/hd4    /               jfs	Oct 06 09:48						rw,log=/dev/hd8 
          /dev/hd2    /usr            jfs	Oct 06 09:48						rw,log=/dev/hd8 
          /dev/hd9var /var            jfs	Oct 06 09:48						rw,log=/dev/hd8 
          /dev/hd3    /tmp            jfs	Oct 06 09:49						rw,log=/dev/hd8 
          /dev/hd1    /home           jfs	Oct 06 09:50						rw,log=/dev/hd8 
    pearl /home       /home           nfs	Oct 07 09:59						rw,soft,bg,intr 
    jade  /usr/local  /usr/local      nfs	Oct 07 09:59						rw,soft,bg,intr 
    

    Determine whether and where the filesystem is mounted, then compare this information against the HACMP definitions to note any differences.

    Checking Available Filesystem Space

    To see the space available on a filesystem, use the df command. For example:

    df 
    Filesystem    Total KB    free %used   iused %iused 	Mounted on 
    /dev/hd4         12288    5308   56%     896    21% 	/ 
    /dev/hd2        413696   26768   93%   19179    18% 	/usr 
    /dev/hd9var       8192    3736   54%     115     5% 	/var 
    /dev/hd3          8192    7576    7%      72     3% 	/tmp 
    /dev/hd1          4096    3932    4%      17     1% 	/home 
    /dev/crab1lv      8192    7904    3%      17     0% 	/crab1fs 
    /dev/crab3lv     12288   11744    4%      16     0% 	/crab3fs 
    /dev/crab4lv     16384   15156    7%      17     0% 	/crab4fs 
    /dev/crablv       4096    3252   20%      17     1% 	/crabfs 
    

    Check the %used column for filesystems that are using more than 90% of their available space. Then check the free column to determine the exact amount of free space left.

    Checking Mount Points, Permissions, and Filesystem Information

    Use the lsfs command to display information about mount points, permissions, filesystem size and so on. For example:

    lsfs 
    Name		Nodename		Mount Pt	 VFS	Size	Options	Auto 
    /dev/hd4		--  		/    	 jfs	24576	--	yes 
    /dev/hd1	 	--  		/home   	 jfs	8192  	--	yes 
    /dev/hd2	 	--  		/usr     	jfs	827392	--	yes 
    /dev/hd9var		--  		/var	 jfs	16384 	--	yes 
    /dev/hd3	 	--    		/tmp   	 jfs	16384 	--	yes 
    /dev/hd7	 	--   		/mnt    	 jfs	--	--	no  
    /dev/hd5	 	--   		/blv	 jfs	--   	--	no  
    /dev/crab1lv		--   		/crab1fs	 jfs	16384 	rw	no  
    /dev/crab3lv		--   		/crab3fs	 jfs	24576 	rw	no  
    /dev/crab4lv		--   		/crab4fs	 jfs	32768 	rw	no  
    /dev/crablv		--		/crabfs	 jfs	8192	rw	no  
    

    Important: For filesystems to be NFS exported, be sure to verify that logical volume names for these filesystems are consistent throughout the cluster.

    Using the C-SPOC Utility to Check Shared Filesystems

    To check to see whether the necessary shared filesystems are mounted and where they are mounted on cluster nodes in a two-node C-SPOC environment:

    In SMIT select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Filesystems. Select from either Journaled Filesystems > List All Shared Filesystems or Enhanced Journaled Filesystems > List All Shared Filesystems to display a list of shared filesystems.

    You can also use the C-SPOC cl_lsfs command from the command line to display this information.

    Checking the Automount Attribute of Filesystems

    At boot time, AIX 5L attempts to check all the filesystems listed in /etc/filesystems with the check=true attribute by running the fsck command. If AIX 5L cannot check a filesystem, it reports the following error:

    Filesystem helper: 0506-519 Device open failed 
    

    For filesystems controlled by HACMP, this error message typically does not indicate a problem. The filesystem check fails because the volume group on which the filesystem is defined is not varied on at boot time.

    To avoid generating this message, edit the /etc/filesystems file to ensure that the stanzas for the shared filesystems do not include the check=true attribute.

    Checking the TCP/IP Subsystem

    To investigate the TCP/IP subsystem, use the following AIX 5L commands:

  • Use the netstat command to make sure that the network interfaces are initialized and that a communication path exists between the local node and the target node.
  • Use the ping command to check the point-to-point connectivity between nodes.
  • Use the ifconfig command on all network interfaces to detect bad IP addresses, incorrect subnet masks, and improper broadcast addresses.
  • Scan the /tmp/hacmp.out file to confirm that the /etc/rc.net script has run successfully. Look for a zero exit status.
  • If IP address takeover is enabled, confirm that the /etc/rc.net script has run and that the service interface is on its service address and not on its base (boot) address.
  • Use the lssrc -g tcpip command to make sure that the inetd daemon is running.
  • Use the lssrc -g portmap command to make sure that the portmapper daemon is running.
  • Use the arp command to make sure that the cluster nodes are not using the same IP or hardware address.
  • Use the netstat command to:
  • Show the status of the network interfaces defined for a node.
  • Determine whether a route from the local node to the target node is defined.
  • The netstat -in command displays a list of all initialized interfaces for the node, along with the network to which that interface connects and its IP address. You can use this command to determine whether the service and standby interfaces are on separate subnets. The subnets are displayed in the Network column.

    netstat -in 
    Name    Mtu    Network       Address         Ipkts   Ierrs  Opkts   Oerrs   Coll 
    lo0    1536    <Link>                        18406     0    18406     0      0 
    lo0    1536    127           127.0.0.1       18406     0    18406     0      0 
    en1    1500    <Link>                        1111626   0    58643     0      0 
    en1    1500    100.100.86.   100.100.86.136  1111626   0    58643     0      0 
    en0    1500    <Link>                        943656    0    52208     0      0 
    en0    1500    100.100.83.   100.100.83.136  943656    0    52208     0      0 
    tr1    1492    <Link>                        1879      0    1656      0      0 
    tr1    1492    100.100.84.   100.100.84.136  1879      0    1656      0      0 
    

    Look at the first, third, and fourth columns of the output. The Name column lists all the interfaces defined and available on this node. Note that an asterisk preceding a name indicates the interface is down (not ready for use). The Network column identifies the network to which the interface is connected (its subnet). The Address column identifies the IP address assigned to the node.

    The netstat -rn command indicates whether a route to the target node is defined. To see all the defined routes, enter:

    netstat -rn 
    

    Information similar to that shown in the following example is displayed:

    Routing tables 
    Destination      Gateway            Flags  Refcnt Use       Interface 
    Netmasks: 
    (root node) 
    (0)0                                          
    (0)0 ff00 0                                   
    (0)0 ffff 0                                   
    (0)0 ffff ff80 0                              
    (0)0 70 204 1 0                               
    (root node)Route Tree for Protocol Family 2: 
    (root node) 
    127              127.0.0.1          U           3     1436  lo0 
    127.0.0.1        127.0.0.1          UH          0      456  lo0 
    100.100.83.128   100.100.83.136     U           6    18243  en0 
    100.100.84.128   100.100.84.136     U           1     1718  tr1 
    100.100.85.128   100.100.85.136     U           2     1721  tr0 
    100.100.86.128   100.100.86.136     U           8    21648  en1 
    100.100.100.128  100.100.100.136    U           0       39  en0 
    (root node)Route Tree for Protocol Family 6: 
    (root node) 
    (root node) 
    

    To test for a specific route to a network (for example 100.100.83), enter:

    netstat -nr | grep '100\.100\.83' 
    100.100.83.128   100.100.83.136     U           6    18243  en0 
    

    The same test, run on a system that does not have this route in its routing table, returns no response. If the service and standby interfaces are separated by a bridge, router, or hub and you experience problems communicating with network devices, the devices may not be set to handle two network segments as one physical network. Try testing the devices independent of the configuration, or contact your system administrator for assistance.

    Note that if you have only one interface active on a network, the Cluster Manager will not generate a failure event for that interface. For more information, see the section on network interface events in the Planning Guide.

    See the netstat man page for more information on using this command.

    Checking Point-to-Point Connectivity

    The ping command tests the point-to-point connectivity between two nodes in a cluster. Use the ping command to determine whether the target node is attached to the network and whether the network connections between the nodes are reliable. Be sure to test all TCP/IP interfaces configured on the nodes (service and standby).

    For example, to test the connection from a local node to a remote node named nodeA enter:

    /etc/ping nodeA 
    PING testcluster.nodeA.com: (100.100.81.141): 56 data bytes 
    64 bytes from 100.100.81.141: icmp_seq=0 ttl=255 time=2 ms 
    64 bytes from 100.100.81.141: icmp_seq=1 ttl=255 time=1 ms 
    64 bytes from 100.100.81.141: icmp_seq=2 ttl=255 time=2 ms 
    64 bytes from 100.100.81.141: icmp_seq=3 ttl=255 time=2 ms 
    

    Type Control-C to end the display of packets. The following statistics appear:

    ----testcluster.nodeA.com PING Statistics---- 
    4 packets transmitted, 4 packets received, 0% packet loss 
    round-trip min/avg/max = 1/1/2 ms 
    

    The ping command sends packets to the specified node, requesting a response. If a correct response arrives, ping prints a message similar to the output shown above indicating no lost packets. This indicates a valid connection between the nodes.

    If the ping command hangs, it indicates that there is no valid path between the node issuing the ping command and the node you are trying to reach. It could also indicate that required TCP/IP daemons are not running. Check the physical connection between the two nodes. Use the ifconfig and netstat commands to check the configuration. A “bad value” message indicates problems with the IP addresses or subnet definitions.

    Note that if “DUP!” appears at the end of the ping response, it means the ping command has received multiple responses for the same address. This response typically occurs when network interfaces have been misconfigured, or when a cluster event fails during IP address takeover. Check the configuration of all interfaces on the subnet to verify that there is only one interface per address. For more information, see the ping man page.

    In addition, you can assign a persistent node IP label to a cluster network on a node. When for administrative purposes you wish to reach a specific node in the cluster using the ping or telnet commands without worrying whether an service IP label you are using belongs to any of the resource groups present on that node, it is convenient to use a persistent node IP label defined on that node.

    For more information on how to assign persistent Node IP labels on the network on the nodes in your cluster, see the Planning Guide and Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.

    Checking the IP Address and Netmask

    Use the ifconfig command to confirm that the IP address and netmask are correct. Invoke ifconfig with the name of the network interface that you want to examine. For example, to check the first Ethernet interface, enter:

    ifconfig en0 
    en0: flags=2000063<UP,BROADCAST,NOTRAILERS,RUNNING,NOECHO> 
    	inet 100.100.83.136 netmask 0xffffff00 broadcast 100.100.83.255 
    

    If the specified interface does not exist, ifconfig replies:

    No such device 
    

    The ifconfig command displays two lines of output. The first line shows the interface’s name and characteristics. Check for these characteristics:

    UP
    The interface is ready for use. If the interface is down, use the ifconfig command to initialize it. For example:
    ifconfig en0 up
    If the interface does not come up, replace the interface cable and try again. If it still fails, use the diag command to check the device.
    RUNNING
    The interface is working. If the interface is not running, the driver for this interface may not be properly installed, or the interface is not properly configured. Review all the steps necessary to install this interface, looking for errors or missed steps.

    The second line of output shows the IP address and the subnet mask (written in hexadecimal). Check these fields to make sure the network interface is properly configured.

    See the ifconfig man page for more information.

    Using the arp Command

    Use the arp command to view what is currently held to be the IP addresses associated with nodes listed in a host’s arp cache. For example:

    arp -a 
    flounder (100.50.81.133) at 8:0:4c:0:12:34 [ethernet] 
    cod (100.50.81.195) at 8:0:5a:7a:2c:85 [ethernet] 
    seahorse (100.50.161.6) at 42:c:2:4:0:0 [token ring] 
    pollock (100.50.81.147) at 10:0:5a:5c:36:b9 [ethernet] 
    

    This output shows what the host node currently believes to be the IP and MAC addresses for nodes flounder, cod, seahorse and pollock. (If IP address takeover occurs without Hardware Address Takeover, the MAC address associated with the IP address in the host’s arp cache may become outdated. You can correct this situation by refreshing the host’s arp cache.)

    See the arp man page for more information.

    Checking Heartbeating over IP Aliases

    The hacmp.out file shows when a heartbeating over IP Aliases addresses is removed from an interface and when it is added to the interface again during an adapter_swap.

    Use the following to check the configuration for heartbeating over IP Aliases:

  • netstat -n shows the aliases
  • clstrmgr.debug shows an IP Alias address when it is mapped to an interface.
  • Checking ATM Classic IP Hardware Addresses

    For Classic IP interfaces, the arp command is particularly useful to diagnose errors. It can be used to verify the functionality of the ATM network on the ATM protocol layer, and to verify the registration of each Classic IP client with its server.

    Example 1

    The following arp command yields the output below:

    arp -t atm -a 
    SVC - at0 on device atm2  
    ========================= 
    at0(10.50.111.4) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.a6.9b.0 
    IP Addr 		VPI:VCI Handle ATM Address 
    stby_1A(10.50.111.2) 								 
    	0:110 21 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.82.48.7 
    server_10_50_111(10.50.111.99)  
    	0:103 14 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a.11.0 
    stby_1C(10.50.111.6)     
    	0:372   11  39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.98.fc.0 
    SVC - at2 on device atm1  
    ======================== 
    at2(10.50.110.4) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.83.63.2 
    IP Addr		VPI:VCI Handle ATM Address 
    boot_1A(10.50.110.2)     
    	0:175   37 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.9e.2d.2 
    server_10_50_110(10.50.110.99)          
    	0:172   34  39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a.10.0 
    boot_1C(10.50.110.6)     
    	0:633   20 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.99.c1.3 
    

    The ATM devices atm1, and atm2, have connected to the ATM switch, and retrieved its address, 39.99.99.99.99.99.99.0.0.99.99.1.1. This address appears in the first 13 bytes of the two clients, at0, and at2. The clients have successfully registered with their corresponding Classic IP server - server_10_50_111 for at0 and server_10_50_110 for at2. The two clients are able to communicate with other clients on the same subnet. (The clients for at0, for example, are stby_1A, and stby_1C.)

    Example 2

    If the connection between an ATM device and the switch is not functional on the ATM layer, the output of the arp command looks as follows:

    arp -t atm -a 
    SVC - at0 on device atm2  
    ========================== 
    at0(10.50.111.4) 8.0.5a.99.a6.9b.0.0.0.0.0.0.0.0.0.0.0.0.0.0 
    

    Here the MAC address of ATM device atm2, 8.0.5a.99.a6.9b, appears as the first six bytes of the ATM address for interface at0. The ATM device atm2 has not registered with the switch, since the switch address does not appear as the first part of the ATM address of at0.

    Checking the AIX 5L Operating System

    To view hardware and software errors that may affect the cluster, use the errpt command. Be on the lookout for disk and network error messages, especially permanent ones, which indicate real failures. See the errpt man page for more information.

    Checking Physical Networks

    Checkpoints for investigating physical connections include:

  • Check the serial line between each pair of nodes.
  • If you are using Ethernet:
  • Use the diag command to verify that the network interface card is good.
  • Ethernet adapters for the IBM eServer pSeries can be used either with the transceiver that is on the card or with an external transceiver. There is a jumper on the NIC to specify which you are using. Verify that your jumper is set correctly.
  • Make sure that hub lights are on for every connected cable.
  • If you are using Token-Ring:
  • Use the diag command to verify that the NIC and cables are good.
  • Make sure that all the nodes in the cluster are on the same ring.
  • Make sure that the ringspeed is set to the same value for all NICs.
  • To review HACMP network requirements, see Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.

    Checking Disks, Disk Adapters, and Disk Heartbeating Networks

    Use the diag command to verify that the adapter card is functioning properly. If problems arise, be sure to check the jumpers, cables, and terminators along the SCSI bus.

    For SCSI disks, including IBM SCSI disks and arrays, make sure that each array controller, adapter, and physical disk on the SCSI bus has a unique SCSI ID. Each SCSI ID on the bus must be an integer value from 0 through 15, although some SCSI adapters may have limitations on the SCSI ID that can be set. See the device documentation for information about any device-specific limitations. A common configuration is to set the SCSI ID of the adapters on the nodes to be higher than the SCSI IDs of the shared devices. Devices with higher IDs take precedence in SCSI bus contention.

    For example, if the standard SCSI adapters use IDs 5 and 6, assign values from 0 through 4 to the other devices on the bus. You may want to set the SCSI IDs of the adapters to 5 and 6 to avoid a possible conflict when booting one of the systems in service mode from a mksysb tape of other boot devices, since this will always use an ID of 7 as the default.

    If the SCSI adapters use IDs of 14 and 15, assign values from 3 through 13 to the other devices on the bus. Refer to your worksheet for the values previously assigned to the adapters.

    You can check the SCSI IDs of adapters and disks using either the lsattr or lsdev command. For example, to determine the SCSI ID of the adapter scsi1 (SCSI-3), use the following lsattr command and specify the logical name of the adapter as an argument:

    lsattr -E -l scsi1 | grep id  
    

    Do not use wildcard characters or full pathnames on the command line for the device name designation.

    Important: If you restore a backup of your cluster configuration onto an existing system, be sure to recheck or reset the SCSI IDs to avoid possible SCSI ID conflicts on the shared bus. Restoring a system backup causes adapter SCSI IDs to be reset to the default SCSI ID of 7.

    If you note a SCSI ID conflict, see the Planning Guide for information about setting the SCSI IDs on disks and disk adapters.

    To determine the SCSI ID of a disk, enter:

    lsdev -Cc disk -H 
    

    Recovering from PCI Hot Plug NIC Failure

    If an unrecoverable error causes a PCI hot-replacement process to fail, you may be left in a state where your NIC is unconfigured and still in maintenance mode. The PCI slot holding the card and/or the new card may be damaged at this point. User intervention is required to get the node back in fully working order.

    For more information, refer to your hardware manuals or search for information about devices on IBM’s website.

    Checking Disk Heartbeating Networks

    Cluster verification confirms whether a disk heartbeating network is correctly configured. RSCT logs provide information for disk heartbeating networks that is similar for information for other types of networks.

    Use the following commands to test connectivity for a disk heartbeating network:

  • dhb_read tests connectivity for a disk heartbeating network.
  • For information about dhb_read, see the RSCT Command for Testing Disk Heartbeating section in Appendix C: HACMP for AIX 5L Commands in the Administration Guide.
  • clip_config provides information about devices discovered for disk heartbeating.
  • lssrc -ls topsvcs shows network activity.
  • Testing a Disk Heartbeating Network

    The first step in troubleshooting a disk heartbeating network is to test the connections. For RS232 networks, the disk heartbeating network cannot be tested while the network is active.

    To use dhb_read to test a disk heartbeating connection:

      1. Set one node to run the command in data mode:
    dhb_read -p hdisk# -r
    where hdisk# identifies the hdisk in the network, such as hdisk1.
      2. Set the other node to run the command in transmit mode:
    dhb_read -p hdisk# -t
    where hdisk# identifies the hdisk in the network, such as hdisk1.
    The hdisk# is the same on both nodes.
    The following message indicates that the communication path is operational:
    Link operating normally.

    If a device that is expected to appear in a picklist does not, view the clip_config file to see what information was discovered.

    $ cat /usr/es/sbin/cluster/etc/config/clip_config | grep diskhb 
    nodeA:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# 
    DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43 
    nodeB:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# 
    DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43 
    

    Disk Heartbeating Networks and Network Failure Detection

    Disk heartbeating networks are identical to other non-IP based networks in terms of the operation of the failure detection rate. However, there is a subtle difference that affects the state of the network endpoints and the events run.

    Disk heartbeating networks work by exchanging heartbeat messages on a reserved portion of a shared disk. As long as the node can access the disk, the network endpoint will be considered up, even if heartbeat messages are not being sent between nodes. The disk heartbeating network itself will still be considered down.

    All other non-IP networks mark the network and both endpoints as down when either endpoint fails. This difference makes it easier to diagnose problems with disk heartbeating networks: If the problem is in the connection of just one node with the shared disk only that part of the network will be marked as down.

    Disk Heartbeating Networks and Fast Node Failure Detection

    HACMP 5.4 provides a method to reduce the time it takes for a node failure to be realized throughout the cluster, while reliably detecting node failures.

    HACMP 5.4 uses disk heartbeating to put a departing message on a shared disk so its neighbor(s) will be immediately aware of the node failure (without waiting for missed heartbeats). Topology Services will then distribute the information about the node failure throughout the cluster and then each Topology Services daemon sends a node_down event to any concerned client.

    For more information see the section Decreasing Node Fallover Time in Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.

    Disk Heartbeating Networks and Failed Disk Enclosures

    In addition to providing a non-IP network to help ensure high availability, you can use disk heartbeating networks to detect failure of a disk enclosure (cabinet). To use this function, configure a disk heartbeating network for at least one disk in each disk enclosure.

    To configure a disk heartbeating network to detect a failure of a disk enclosure:

      1. Configure a disk heartbeating network for a disk in the specified enclosure. For information about configuring a disk heartbeating network, see the section Configuring Heartbeating over Disk in the Administration Guide.
      2. Create a pre- or post-event, or a notification method, to determine the action to be taken in response to a failure of the disk heartbeating network. (A failure of the disk enclosure would be seen as a failure of the disk heartbeating network.)

    Checking the Cluster Communications Daemon

    In some cases, if you change or remove IP addresses in the AIX 5L adapter configuration, and this takes place after the cluster has been synchronized, the Cluster Communications daemon cannot validate these addresses against the /usr/es/sbin/cluster/etc/rhosts file or against the entries in the HACMP’s Configuration Database, and HACMP issues an error.

    Or, you may obtain an error during the cluster synchronization.

    In this case, you must update the information that is saved in the /usr/es/sbin/cluster/etc/rhosts file on all cluster nodes, and refresh clcomd to make it aware of the changes. When you synchronize and verify the cluster again, clcomd starts using IP addresses added to HACMP Configuration Database.

    To refresh the Cluster Communications daemon, use:

    refresh -s clcomdES

    Also, configure the /usr/es/sbin/cluster/etc/rhosts file to contain all the addresses currently used by HACMP for inter-node communication, and then copy this file to all cluster nodes.

    For troubleshooting other related problems, also see Cluster Communications Issues in this chapter.

    Checking System Hardware

    Check the power supplies and LED displays to see if any error codes are displayed. Run the AIX 5L diag command to test the system unit.

    Without an argument, diag runs as a menu-driven program. You can also run diag on a specific piece of hardware. For example:

    diag -d hdisk0 -c 
    Starting diagnostics. 
    Ending diagnostics. 
    

    This output indicates that hdisk0 is okay.

    HACMP Installation Issues

    The following potential installation issues are described here:

  • Cannot Find Filesystem at Boot Time
  • cl_convert Does Not Run Due to Failed Installation
  • Configuration Files Could Not Be Merged during Installation.
  • Cannot Find Filesystem at Boot Time

    Problem

    At boot-time, AIX 5L tries to check, by running the fsck command, all the filesystems listed in /etc/filesystems with the check=true attribute. If it cannot check a filesystem. AIX 5L reports the following error:

        +----------------------------------------------------------+ 
    	 Filesystem Helper: 0506-519 Device open failed 
      +----------------------------------------------------------+ 
    

    Solution

    For filesystems controlled by HACMP, this error typically does not indicate a problem. The filesystem check failed because the volume group on which the filesystem is defined is not varied on at boot-time. To prevent the generation of this message, edit the /etc/filesystems file to ensure that the stanzas for the shared filesystems do not include the check=true attribute.

    cl_convert Does Not Run Due to Failed Installation

    Problem

    When you install HACMP, cl_convert is run automatically. The software checks for an existing HACMP configuration and attempts to convert that configuration to the format used by the version of the software bring installed. However, if installation fails, cl_convert will fail to run as a result. Therefore, conversion from the Configuration Database of a previous HACMP version to the Configuration Database of the current version will also fail.

    Solution

    Run cl_convert from the command line. To gauge conversion success, refer to the /tmp/clconvert.log file, which logs conversion progress.

    Root user privilege is required to run cl_convert.

    Warning: Before converting to HACMP 5.4, be sure that your ODMDIR environment variable is set to /etc/es/objrepos.

    For information on cl_convert flags, refer to the cl_convert man page.

    Configuration Files Could Not Be Merged during Installation

    Problem

    During the installation of HACMP client software, the following message appears:

        +----------------------------------------------------------+ 
                     Post-installation Processing... 
      +----------------------------------------------------------+ 
    Some configuration files could not be automatically merged into the 
    system during the installation. The previous versions of these files 
    have been saved in a configuration directory as listed below. Compare 
    the saved files and the newly installed files to determine if you need 
    to recover configuration data. Consult product documentation to 
    determine how to merge the data. 
    Configuration files, which were saved in /usr/lpp/save.config:       
    /usr/es/sbin/cluster/utilities/clexit.rc 
    

    Solution

    As part of the HACMP installation process, copies of HACMP files that could potentially contain site-specific modifications are saved in the /usr/lpp/save.config directory before they are overwritten. As the message states, you must merge site-specific configuration information into the newly installed files.

    HACMP Startup Issues

    The following potential HACMP startup issues are described here:

  • ODMPATH Environment Variable Not Set Correctly
  • clinfo Daemon Exits after Starting
  • Node Powers Down; Cluster Manager Will Not Start
  • configchk Command Returns an Unknown Host Message
  • Cluster Manager Hangs during Reconfiguration
  • clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes
  • Pre- or Post-Event Does Not Exist on a Node after Upgrade
  • Node Fails During Configuration with “869” LED Display
  • Node Cannot Rejoin Cluster after Being Dynamically Removed
  • Resource Group Migration Is Not Persistent after Cluster Startup.
  • SP Cluster Does Not Startup after Upgrade to HACMP 5.4.
  • ODMPATH Environment Variable Not Set Correctly

    Problem

    Queried object not found.

    Solution

    HACMP has a dependency on the location of certain ODM repositories to store configuration data. The ODMPATH environment variable allows ODM commands and subroutines to query locations other than the default location if the queried object does not reside in the default location. You can set this variable, but it must include the default location, /etc/objrepos, or the integrity of configuration information may be lost.

    clinfo Daemon Exits after Starting

    Problem

    The “smux-connect” error occurs after starting the clinfoES daemon with the -a option. Another process is using port 162 to receive traps.

    Solution

    Check to see if another process, such as the trapgend smux subagent of NetView for AIX 5L or the System Monitor for AIX 5L sysmond daemon, is using port 162. If so, restart clinfoES without the -a option and configure NetView for AIX 5L to receive the SNMP traps. Note that you will not experience this error if clinfoES is started in its normal way using the startsrc command.

    Node Powers Down; Cluster Manager Will Not Start

    Problem

    The node powers itself off or appears to hang after starting the Cluster Manager. The configuration information does not appear to be identical on all nodes, causing the clexit.rc script to issue a halt -q to the system.

    Solution

    Use the cluster verification utility to uncover discrepancies in cluster configuration information on all cluster nodes.

    Correct any configuration errors uncovered by the cluster verification utility. Make the necessary changes using the HACMP Initialization and Standard Configuration or Extended Configuration SMIT panels. After correcting the problem, select the Verify and Synchronize HACMP Configuration option to synchronize the cluster resources configuration across all nodes. Then select the Start Cluster Services option from the System Management (C-SPOC) > Manage HACMP Services SMIT panel to start the Cluster Manager.

    The Cluster Manager should not exit if the configuration has passed cluster verification. If it does exit, use the AIX 5L snap -e command to collect HACMP cluster data, including the log files and open a Program Management Report (PMR) requesting performance assistance.

    For more information about the snap -e command, see the section Using the AIX Data Collection Utility, in Chapter 1: Troubleshooting HACMP Clusters.

    You can modify the file /etc/cluster/hacmp.term to change the default action after an abnormal exit. The clexit.rc script checks for the presence of this file, and if you have made it executable, the instructions there will be followed instead of the automatic halt called by clexit.rc. Please read the caveats contained in the /etc/cluster/hacmp.term file, before making any modifications. For more information, see the section Abnormal Termination of a Cluster Daemon in the Administration Guide.

    configchk Command Returns an Unknown Host Message

    Problem

    The /etc/hosts file on each cluster node does not contain the IP labels of other nodes in the cluster. For example, in a four-node cluster, Node A, Node B, and Node C’s /etc/hosts files do not contain the IP labels of the other cluster nodes.

    If this situation occurs, the configchk command returns the following message to the console:

    "your hostname not known," "Cannot access node x." 
    

    which indicates that the /etc/hosts file on Node x does not contain an entry for your node.

    Solution

    Before starting the HACMP software, ensure that the /etc/hosts file on each node includes the service and boot IP labels of each cluster node.

    Cluster Manager Hangs during Reconfiguration

    Problem

    The Cluster Manager hangs during reconfiguration and generates messages similar to the following:

    The cluster has been in reconfiguration too long;Something may be wrong. 
    

    An event script has failed.

    Solution

    Determine why the script failed by examining the /tmp/hacmp.out file to see what process exited with a non-zero status. The error messages in the /usr/adm/cluster.log file may also be helpful. Fix the problem identified in the log file. Then run the clruncmd command either at the command line, or by using the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel. The clruncmd command signals the Cluster Manager to resume cluster processing.

    clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes

    Problem

    On newly installed AIX 5L nodes, clcomdES and clstrmgrES fail to start.

    Solution

    Manually indicate to the system console (for the AIX installation assistant) that the AIX 5L installation is finished.

    This problem usually occurs on newly installed AIX nodes; at the first boot AIX runs the installation assistant from /etc/inittab and does not proceed with other entries in this file. AIX 5L installation assistant waits for your input on system console. AIX 5L will run the installation assistant on every subsequent boot, until you indicate that installation is finished. Once you do so, the system will proceed to start the cluster communications daemon (clcomdES) and the Cluster Manager daemon (clstrmgr).

    Pre- or Post-Event Does Not Exist on a Node after Upgrade

    Problem

    The cluster verification utility indicates that a pre- or post-event does not exist on a node after upgrading to a new version of the HACMP software.

    Solution

    Ensure that a script by the defined name exists and is executable on all cluster nodes.

    Each node must contain a script associated with the defined pre- or post-event. While the contents of the script do not have to be the same on each node, the name of the script must be consistent across the cluster. If no action is desired on a particular node, a no-op script with the same event-script name should be placed on nodes on which no processing should occur.

    Node Fails During Configuration with “869” LED Display

    Problem

    The system appears to be hung. “869” is displayed continuously on the system LED display.

    Solution

    A number of situations can cause this display to occur. Make sure all devices connected to the SCSI bus have unique SCSI IDs to avoid SCSI ID conflicts. In particular, check that the adapters and devices on each cluster node connected to the SCSI bus have a different SCSI ID. By default, AIX 5L assigns an ID of 7 to a SCSI adapter when it configures the adapter. See the Planning Guide for more information on checking and setting SCSI IDs.

    Node Cannot Rejoin Cluster after Being Dynamically Removed

    Problem

    A node that has been dynamically removed from a cluster cannot rejoin.

    Solution

    When you remove a node from the cluster, the cluster definition remains in the node’s Configuration Database. If you start cluster services on the removed node, the node reads this cluster configuration data and attempts to rejoin the cluster from which it had been removed. The other nodes no longer recognize this node as a member of the cluster and refuse to allow the node to join. Because the node requesting to join the cluster has the same cluster name as the existing cluster, it can cause the cluster to become unstable or crash the existing nodes.

    To ensure that a removed node cannot be restarted with outdated Configuration Database information, complete the following procedure to remove the cluster definition from the node:

      1. Stop cluster services on the node to be removed using the following command:
    clstop -R
    Warning: You must stop cluster services on the node before removing it from the cluster.
    The -R flag removes the HACMP entry in the /etc/inittab file, preventing cluster services from being automatically started when the node is rebooted.
      2. Remove the HACMP entry from the rc.net file using the following command:
    clchipat false
      3. Remove the cluster definition from the node’s Configuration Database using the
      following command:
    clrmclstr

    You can also perform this task by selecting Extended Configuration > Extended Topology Configuration > Configure an HACMP Cluster > Remove an HACMP Cluster from the SMIT panel.

    Resource Group Migration Is Not Persistent after Cluster Startup

    Problem

    You have specified a resource group migration operation using the Resource Group Migration Utility, in which you have requested that this particular migration Persists across Cluster Reboot, by setting this flag to true (or, by issuing the clRGmove -p command). Then, after you stopped and restarted the cluster services, this policy is not followed on one of the nodes in the cluster.

    Solution

    This problem occurs if, when you specified the persistent resource group migration, a node was down and inaccessible. In this case, the node did not obtain information about the persistent resource group migration, and, if after the cluster services are restarted, this node is the first to join the cluster, it will have no knowledge of the Persist across Cluster Reboot setting. Thus, the resource group migration will not be persistent. To restore the persistent migration setting, you must again specify it in SMIT under the Extended Resource Configuration > HACMP Resource Group Configuration SMIT menu.

    SP Cluster Does Not Startup after Upgrade to HACMP 5.4

    Problem

    The ODM entry for group “hacmp” is removed on SP nodes. This problem manifests itself as the inability to start the cluster or clcomd errors.

    Solution

    To further improve security, the HACMP Configuration Database (ODM) has the following enhancements:

  • Ownership. All HACMP ODM files are owned by user root and group hacmp. In addition, all HACMP binaries that are intended for use by non-root users are also owned by user root and group hacmp.
  • Permissions. All HACMP ODM files, except for the hacmpdisksubsystem file with 600 permissions, are set with 640 permissions (readable by user root and group hacmp, writable by user root). All HACMP binaries that are intended for use by non-root users are installed with 2555 permissions (readable and executable by all users, with the setgid bit turned on so that the program runs as group hacmp).
  • During the installation, HACMP creates the group “hacmp” on all nodes if it does not already exist. By default, group hacmp has permission to read the HACMP ODMs, but does not have any other special authority. For security reasons, it is recommended not to expand the authority of group hacmp.

    If you use programs that access the HACMP ODMs directly, you may need to rewrite them if they are intended to be run by non-root users:

  • All access to the ODM data by non-root users should be handled via the provided HACMP utilities.
  • In addition, if you are using the PSSP File Collections facility to maintain the consistency of /etc/group, the new group “hacmp” that is created at installation time on the individual cluster nodes may be lost when the next file synchronization occurs.
  • There are two possible solutions to this problem. Take one of the following actions before installing HACMP 5.4:

      a. Turn off PSSP File Collections synchronization of /etc/group

    or

      b. Ensure that group “hacmp” is included in the master /etc/group file and ensure that the change is propagated to all cluster nodes.

    Disk and Filesystem Issues

    The following potential disk and filesystem issues are described here:

  • AIX 5L Volume Group Commands Cause System Error Reports
  • Verification Fails on Clusters with Disk Heartbeating Networks
  • varyonvg Command Fails on a Volume Group
  • cl_nfskill Command Fails
  • cl_scdiskreset Command Fails
  • fsck Command Fails at Boot Time
  • System Cannot Mount Specified Filesystems
  • Cluster Disk Replacement Process Fails
  • Automatic Error Notification Fails with Subsystem Device Driver
  • Filesystem Change Not Recognized by Lazy Update.
  • AIX 5L Volume Group Commands Cause System Error Reports

    Problem

    The redefinevg, varyonvg, lqueryvg, and syncvg commands fail and report errors against a shared volume group during system restart. These commands send messages to the console when automatically varying on a shared volume group. When configuring the volume groups for the shared disks, autovaryon at boot was not disabled. If a node that is up owns the shared drives, other nodes attempting to vary on the shared volume group will display various varyon error messages.

    Solution

    When configuring the shared volume group, set the Activate volume group AUTOMATICALLY at system restart? field to no on the SMIT System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups > Create a Shared Volume Group panel. After importing the shared volume group on the other cluster nodes, use the following command to ensure that the volume group on each node is not set to autovaryon at boot:

    chvg -an vgname 
    

    Verification Fails on Clusters with Disk Heartbeating Networks

    Problem 1

    With clusters that have disk heartbeating networks configured, when running verification the HACMP software indicates that verification failed “PVIDs do not match” error message.

    Solution 1

    Run verification with verbose logging to view messages that indicate where the error occurred (for example, the node, device, or command). The verification utility uses verbose logging to write to the /var/hacmp/clverify/clverify.log file.

    If the hdisks have been renumbered, the disk heartbeating network may no longer be valid. Remove the disk heartbeating network and redefine it.

    Ensure that the disk heartbeating networks are configured on enhanced concurrent volume groups. You can convert an existing volume group to enhanced concurrent mode. For information about converting a volume group, see the chapter Managing Shared LVM Components in a Concurrent Access Environment in the Administration Guide.

    After correcting the problem, select the Verify and Synchronize HACMP Configuration option to synchronize the cluster resources configuration across all nodes. Then select the Start Cluster Services option from the System Management (C-SPOC) > Manage HACMP Services SMIT panel to start the Cluster Manager.

    varyonvg Command Fails on a Volume Group

    Problem 1

    The HACMP software (the /tmp/hacmp.out file) indicates that the varyonvg command failed when trying to vary on a volume group.

    Solution 1

    Ensure that the volume group is not set to autovaryon on any node and that the volume group (unless it is in concurrent access mode) is not already varied on by another node.

    The lsvg -o command can be used to determine whether the shared volume group is active. Enter:

    lsvg volume_group_name 
    

    on the node that has the volume group activated, and check the AUTO ON field to determine whether the volume group is automatically set to be on. If AUTO ON is set to yes, correct this by entering

    chvg -an volume_group_name 
    

    Problem 2

    The volume group information on disk differs from that in the Device Configuration Data Base.

    Solution 2

    Correct the Device Configuration Data Base on the nodes that have incorrect information:

      1. Use the smit exportvg fastpath to export the volume group information. This step removes the volume group information from the Device Configuration Data Base.
      2. Use the smit importvg fastpath to import the volume group. This step creates a new Device Configuration Data Base entry directly from the information on disk. After importing, be sure to change the volume group to not autovaryon at the next system boot.
      3. Use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing.

    Problem 3

    The HACMP software indicates that the varyonvg command failed because the volume group could not be found.

    Solution 3

    The volume group is not defined to the system. If the volume group has been newly created and exported, or if a mksysb system backup has been restored, you must import the volume group. Follow the steps described in Problem 2 to verify that the correct volume group name is being referenced.

    Problem 4

    The HACMP software indicates that the varyonvg command failed because the logical volume <name> is incomplete.

    Solution 4

    This indicates that the forced varyon attribute is configured for the volume group in SMIT, and that when attempting a forced varyon operation, HACMP did not find a single complete copy of the specified logical volume for this volume group.

    Also, it is possible that you requested a forced varyon operation but did not specify the super strict allocation policy for the mirrored logical volumes. In this case, the success of the varyon command is not guaranteed. For more information on the forced varyon functionality, see the chapter Planning Shared LVM Components in the Planning Guide and the Forcing a Varyon of Volume Groups section in the chapter on Configuring HACMP Resource Groups (Extended) in the Administration Guide.

    cl_nfskill Command Fails

    Problem

    The /tmp/hacmp.out file shows that the cl_nfskill command fails when attempting to perform a forced unmount of an NFS-mounted filesystem. NFS provides certain levels of locking a filesystem that resists forced unmounting by the cl_nfskill command.

    Solution

    Make a copy of the /etc/locks file in a separate directory before executing the cl_nfskill command. Then delete the original /etc/locks file and run the cl_nfskill command. After the command succeeds, re-create the /etc/locks file using the saved copy.

    cl_scdiskreset Command Fails

    Problem

    The cl_scdiskreset command logs error messages to the /tmp/hacmp.out file. To break the reserve held by one system on a SCSI device, the HACMP disk utilities issue the cl_scdiskreset command. The cl_scdiskreset command may fail if back-level hardware exists on the SCSI bus (adapters, cables or devices) or if a SCSI ID conflict exists on the bus.

    Solution

    See the appropriate sections in Chapter 2: Using Cluster Log Files to check the SCSI adapters, cables, and devices. Make sure that you have the latest adapters and cables. The SCSI IDs for each SCSI device must be different.

    fsck Command Fails at Boot Time

    Problem

    At boot time, AIX 5L runs the fsck command to check all the filesystems listed in /etc/filesystems with the check=true attribute. If it cannot check a filesystem, AIX 5L reports the following error:

    Filesystem Helper: 0506-519 Device open failed 
    

    Solution

    For filesystems controlled by HACMP, this message typically does not indicate a problem. The filesystem check fails because the volume group defining the filesystem is not varied on. The boot procedure does not automatically vary on HACMP-controlled volume groups.

    To prevent this message, make sure that all the filesystems under HACMP control do not have the check=true attribute in their /etc/filesystems stanzas. To delete this attribute or change it to check=false, edit the /etc/filesystems file.

    System Cannot Mount Specified Filesystems

    Problem

    The /etc/filesystems file has not been updated to reflect changes to log names for a logical volume. If you change the name of a logical volume after the filesystems have been created for that logical volume, the /etc/filesystems entry for the log does not get updated. Thus when trying to mount the filesystems, the HACMP software tries to get the required information about the logical volume name from the old log name. Because this information has not been updated, the filesystems cannot be mounted.

    Solution

    Be sure to update the /etc/filesystems file after making changes to logical volume names.

    Cluster Disk Replacement Process Fails

    Problem 1

    You cannot complete the disk replacement process due to a node_down event.

    Solution 1

    Once the node is back online, export the volume group, then import it again before starting HACMP on this node.

    Problem 2

    The disk replacement process failed while the replacepv command was running.

    Solution 2

    Delete the /tmp/replacepv directory, and attempt the replacement process again.

    You can also try running the process on another disk.

    Problem 3

    The disk replacement process failed with a “no free disks” message while VPATH devices were available for replacement.

    Solution 3

    Be sure to convert the volume group from VPATH devices to hdisks, and attempt the replacement process again. When the disk is replaced, convert hdisks back to the VPATH devices. For instructions, see the Convert SDD VPATH Device Volume Group to an ESS hdisk Device Volume Group section in the chapter on Managing Shared LVM Components in the Administration Guide.

    Automatic Error Notification Fails with Subsystem Device Driver

    Problem

    You set up automatic error notification for the 2105 IBM Enterprise Storage System (ESS), expecting it to log errors when there is a volume group loss. (The Subsystem Device Driver handles the loss.) However, the error notification fails and you get error messages in the cspoc.log and the smit.log.

    Solution

    If you set up automatic error notification for the 2105 IBM Enterprise Storage System (ESS), which uses the Subsystem Device Driver, all PVIDs must be on VPATHS, or the error notification fails. To avoid this failure, convert all hdisks to VPATH devices.

    Filesystem Change Not Recognized by Lazy Update

    Problem

    If you change the name of a filesystem, or remove a filesystem and then perform a lazy update, lazy update does not run the imfs -lx command before running the imfs command. This may lead to a failure during fallover or prevent a successful restart of the HACMP cluster services.

    Solution

    Use the C-SPOC utility to change or remove filesystems. This ensures that imfs -lx runs before imfs and that the changes are updated on all nodes in the cluster.

    Error Reporting provides detailed information about inconsistency in volume group state across the cluster. If this happens, take manual corrective action. If the filesystem changes are not updated on all nodes, update the nodes manually with this information.

    Network and Switch Issues

    The following potential network and switch issues are described here:

  • Unexpected Network Interface Failure in Switched Networks
  • Cluster Nodes Cannot Communicate
  • Distributed SMIT Causes Unpredictable Results
  • Token-Ring Network Thrashes
  • System Crashes Reconnecting MAU Cables after a Network Failure
  • TMSCSI Will Not Properly Reintegrate when Reconnecting Bus
  • Recovering from PCI Hot Plug NIC Failure
  • Unusual Cluster Events Occur in Non-Switched Environments
  • Cannot Communicate on ATM Classic IP Network
  • Cannot Communicate on ATM LAN Emulation Network
  • IP Label for HACMP Disconnected from AIX 5L Interface
  • TTY Baud Rate Setting Wrong
  • First Node Up Gives Network Error Message in hacmp.out
  • Network Interface Card and Network ODMs Out of Sync with Each Other
  • Non-IP Network, Network Adapter or Node Failures
  • Networking Problems Following HACMP Fallover
  • Packets Lost during Data Transmission
  • Verification Fails when Geo Networks Uninstalled
  • Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks.
  • Unexpected Network Interface Failure in Switched Networks

    Problem

    Unexpected network interface failures can occur in HACMP configurations using switched networks if the networks and the switches are incorrectly defined/configured.

    Solution

    Take care to configure your switches and networks correctly. See the section on considerations for switched networks in the Planning Guide for more information.

    Troubleshooting VLANs

    Problem

    Interface failures in Virtual LAN networks (from now on referred to as VLAN, Virtual Local Area Network)

    Solution

    To troubleshoot VLAN interfaces defined to HACMP and detect an interface failure, consider these interfaces as interfaces defined on single adapter networks.

    For information on single adapter networks and the use of the netmon.cf file, see Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks.

    In particular, list the network interfaces that belong to a VLAN in the ping_client_list variable in the /usr/es/sbin/cluster/etc/clinfo.rc script and run clinfo. This way, whenever a cluster event occurs, clinfo monitors and detects a failure of the listed network interfaces. Due to the nature of Virtual Local Area Networks, other mechanisms to detect the failure of network interfaces are not effective.

    Cluster Nodes Cannot Communicate

    Problem

    If your configuration has two or more nodes connected by a single network, you may experience a partitioned cluster. A partitioned cluster occurs when cluster nodes cannot communicate. In normal circumstances, a service network interface failure on a node causes the Cluster Manager to recognize and handle a swap_adapter event, where the service IP label/address is replaced with another IP label/address. However, if no other network interface is available, the node becomes isolated from the cluster. Although the Cluster Managers on other nodes are aware of the attempted swap_adapter event, they cannot communicate with the now isolated (partitioned) node because no communication path exists.

    A partitioned cluster can cause GPFS to lose quorum. For more information, see the Appendix on GPFS Cluster Configuration, in the Installation Guide.

    Solution

    Make sure your network is configured for no single point of failure.

    Distributed SMIT Causes Unpredictable Results

    Problem

    Using the AIX 5L utility DSMIT on operations other than starting or stopping HACMP cluster services, can cause unpredictable results.

    Solution

    DSMIT manages the operation of networked IBM eServer pSeries processors. It includes the logic necessary to control execution of AIX 5L commands on all networked nodes. Since a conflict with HACMP functionality is possible, use DSMIT only to start and stop HACMP cluster services.

    Token-Ring Network Thrashes

    Problem

    A Token-Ring network cannot reach steady state unless all stations are configured for the same ring speed. One symptom of the adapters being configured at different speeds is a clicking sound heard at the MAU (multi-station access unit).

    Solution

    Configure all adapters for either 4 or 16 Mbps.

    System Crashes Reconnecting MAU Cables after a Network Failure

    Problem

    A global network failure occurs and crashes all nodes in a four-node cluster after reconnecting MAUs (multi-station access unit). More specifically, if the cables that connect multiple MAUs are disconnected and then reconnected, all cluster nodes begin to crash.

    This result happens in a configuration where three nodes are attached to one MAU (MAU1) and a fourth node is attached to a second MAU (MAU2). Both MAUs (1 and 2) are connected together to complete a Token-Ring network. If MAU1 is disconnected from the network, all cluster nodes can continue to communicate; however, if MAU2 is disconnected, node isolation occurs.

    Solution

    To avoid causing the cluster to become unstable, do not disconnect cables connecting multiple MAUs in a Token-Ring configuration.

    TMSCSI Will Not Properly Reintegrate when Reconnecting Bus

    Problem

    If the SCSI bus is broken while running as a target mode SCSI network, the network will not properly reintegrate when reconnecting the bus.

    Solution

    The HACMP software may need to be restarted on all nodes attached to that SCSI bus. When target mode SCSI is enabled and the cfgmgr command is run on a particular machine, it will go out on the bus and create a target mode initiator for every node that is on the SCSI network. In a four-node cluster, when all four nodes are using the same SCSI bus, each machine will have three initiator devices (one for each of the other nodes).

    In this configuration, use a maximum of four target mode SCSI networks. You would therefore use networks between nodes A and B, B and C, C and D, and D and A.

    Target mode SCSI devices are not always properly configured during the AIX 5L boot process. Ensure that all the tmscsi initiator devices are available on all nodes before bringing up the cluster. To do this run lsdev -Cc tmscsi, which returns:

    tmscsix  Available 00-12-00-40 SCSI I/O Controller Initiator Device 
    

    where x identifies the particular tmscsi device. If the status is not “Available,” run the cfgmgr command and check again.

    Recovering from PCI Hot Plug NIC Failure

    Problem

    If an unrecoverable error causes a PCI hot-replacement process to fail, the NIC may be left in an unconfigured state and the node may be left in maintenance mode. The PCI slot holding the NIC and/or the new NIC may be damaged at this point.

    Solution

    User intervention is required to get the node back in fully working order. For more information, refer to the AIX Managing Hot Plug Connectors from System Management Guide: Operating System and Devices.

    Unusual Cluster Events Occur in Non-Switched Environments

    Problem

    Some network topologies may not support the use of simple switches. In these cases, you should expect that certain events may occur for no apparent reason. These events may be:

  • Cluster unable to form, either all or some of the time
  • swap_adapter pairs
  • swap_adapter, immediately followed by a join_standby
  • fail_standby and join_standby pairs.
  • These events occur when ARP packets are delayed or dropped. This is correct and expected HACMP behavior, as HACMP is designed to depend on core protocols strictly adhering to their related RFCs.

    For a review of basic HACMP network requirements, see the Planning Guide.

    Solution

    The following implementations may reduce or circumvent these events:

  • Increase the Failure Detection Rate (FDR) to exceed the ARP retransmit time of 15 seconds, where typical values have been calculated as follows:
  • FDR = (2+ * 15 seconds) + >5 = 35+ seconds (usually 45-60 seconds)

    “2+” is a number greater than one in order to allow multiple ARP requests to be generated. This is required so that at least one ARP response will be generated and received before the FDR time expires and the network interface is temporarily marked down, then immediately marked back up.

    Keep in mind, however, that the “true” fallover is delayed for the value of the FDR.

  • Increase the ARP queue depth.
  • If you increase the queue, requests that are dropped or delayed will be masked until network congestion or network quiescence (inactivity) makes this problem evident.

  • Use a dedicated switch, with all protocol optimizations turned off. Segregate it into a physical LAN segment and bridge it back into the enterprise network.
  • Use permanent ARP entries (IP address to MAC address bindings) for all network interfaces. These values should be set at boot time, and since none of the ROM MAC addresses are used, replacing network interface cards will be invisible to HACMP.
  • Note: The above four items simply describe how some customers have customized their unique enterprise network topology to provide the classic protocol environment (strict adherence to RFCs) that HACMP requires. IBM cannot guarantee HACMP will work as expected in these approaches, since none addresses the root cause of the problem. If your network topology requires consideration of any of these approaches please contact the IBM Consult Line for assistance.

    Cannot Communicate on ATM Classic IP Network

    Problem

    If you cannot communicate successfully to a cluster network interface of type atm (a cluster network interface configured over a Classic IP client, check the following:

    Solution

      1. Check the client configuration. Check that the 20-Byte ATM address of the Classic IP server that is specified in the client configuration is correct, and that the interface is configured as a Classic IP client (svc-c) and not as a Classic IP server (svc-s).
      2. Check that the ATM TCP/IP layer is functional. Check that the UNI version settings that are configured for the underlying ATM device and for the switch port to which this device is connected are compatible. It is recommended not to use the value auto_detect for either side.
    If the connection between the ATM device# and the switch is not functional on the ATM protocol layer, this can also be due to a hardware failure (NIC, cable, or switch).
    Use the arp command to verify this:
    [bass][/]>arp -t atm -a
    SVC - at0 on device atm1 -
    ==========================
    at0(10.50.111.6) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.98.fc.0
    IP Addr VPI:VCI Handle ATM Address
    server_10_50_111(10.50.111.255) 0:888 15 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.11.0
    SVC - at1 on device atm0 -
    ==========================
    at1(10.50.120.6) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.99.c1.1
    IP Addr VPI:VCI Handle ATM Address
    ?(0.0.0.0) N/A N/A 15 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.20.0
    SVC - at3 on device atm2 -
    ==========================
    at3(10.50.110.6) 8.0.5a.99.00.c1.0.0.0.0.0.0.0.0.0.0.0.0.0.0
    IP Addr VPI:VCI Handle ATM Address
    ?(0.0.0.0) 0:608 16 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.10.0

    In the example above, the client at0 is operational. It has registered with its server, server_10_50_111.

    The client at1 is not operational, since it could not resolve the address of its Classic IP server, which has the hardware address 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.11.0. However, the ATM layer is functional, since the 20 byte ATM address that has been constructed for the client at1 is correct. The first 13 bytes is the switch address, 39.99.99.99.99.99.99.0.0.99.99.1.1.

    For client at3, the connection between the underlying device atm2 and the ATM switch is not functional, as indicated by the failure to construct the 20 Byte ATM address of at3. The first 13 bytes do not correspond to the switch address, but contain the MAC address of the ATM device corresponding to atm2 instead.

    Cannot Communicate on ATM LAN Emulation Network

    Problem

    You are having problems communicating with an ATM LANE client.

    Solution

    Check that the LANE client is registered correctly with its configured LAN Emulation server. A failure of a LANE client to connect with its LAN Emulation server can be due to the configuration of the LAN Emulation server functions on the switch. There are many possible reasons.

      1. Correct client configuration: Check that the 20 Byte ATM address of the LAN Emulation server, the assignment to a particular ELAN, and the Maximum Frame Size value are all correct.
      2. Correct ATM TCP/IP layer: Check that the UNI version settings that are configured for the underlying ATM device and for the switch port to which this device is connected are compatible. It is recommended not to use the value auto_detect for either side.
    If the connection between the ATM device# and the switch is not functional on the ATM protocol layer, this can also be due to a hardware failure (NIC, cable, or switch).
    Use the enstat and tokstat commands to determine the state of ATM LANE clients.
    bass][/]> entstat -d ent3

    The output will contain the following:

    General Statistics:
    -------------------
    No mbuf Errors: 0
    Adapter Reset Count: 3
    Driver Flags: Up Broadcast Running
    Simplex AlternateAddress
    ATM LAN Emulation Specific Statistics:
    --------------------------------------
    Emulated LAN Name: ETHER3
    Local ATM Device Name: atm1
    Local LAN MAC Address:
    42.0c.01.03.00.00
    Local ATM Address:
    39.99.99.99.99.99.99.00.00.99.99.01.01.08.00.5a.99.98.fc.04
    Auto Config With LECS:
    No
    LECS ATM Address:
    00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00
    LES ATM Address:
    39.99.99.99.99.99.99.00.00.99.99.01.01.88.88.88.88.00.03.00

    In the example above, the client is operational as indicated by the Running flag.

    If the client had failed to register with its configured LAN Emulation Server, the Running flag would not appear, instead the flag Limbo would be set.

    If the connection of the underlying device atm# was not functional on the ATM layer, then the local ATM address would not contain as the first 13 Bytes the Address of the ATM switch.

      3. Switch-specific configuration limitations: Some ATM switches do not allow more than one client belonging to the same ELAN and configured over the same ATM device to register with the LAN Emulation Server at the same time. If this limitation holds and two clients are configured, the following are typical symptoms.
  • Cyclic occurrence of events indicating network interface failures, such as fail_standby, join_standby, and swap_adapter
  • This is a typical symptom if two such clients are configured as cluster network interfaces. The client which first succeeds in registering with the LES will hold the connection for a specified, configuration-dependent duration. After it times out the other client succeeds in establishing a connection with the server, hence the cluster network interface configured on it will be detected as alive, and the former as down.

  • Sporadic events indicating an network interface failure (fail_standby, join_standby, and swap_adapter)
  • If one client is configured as a cluster network interface and the other outside, this configuration error may go unnoticed if the client on which the cluster network interface is configured manages to register with the switch, and the other client remains inactive. The second client may succeed at registering with the server at a later moment, and a failure would be detected for the cluster network interface configured over the first client.

    IP Label for HACMP Disconnected from AIX 5L Interface

    Problem

    When you define network interfaces to the cluster configuration by entering or selecting an HACMP IP label, HACMP discovers the associated AIX 5L network interface name. HACMP expects this relationship to remain unchanged. If you change the name of the AIX 5L network interface name after configuring and synchronizing the cluster, HACMP will not function correctly.

    Solution

    If this problem occurs, you can reset the network interface name from the SMIT HACMP System Management (C-SPOC) panel. For more information, see the chapter on Managing the Cluster Resources in the Administration Guide.

    TTY Baud Rate Setting Wrong

    Problem

    The default baud rate is 38400. Some modems or devices are incapable of doing 38400. If this is the case for your situation, you can change the default by customizing the RS232 network module to read the desired baud rate (9600/19200/38400).

    Solution

    Change the Custom Tuning Parameters for the RS232 network module. for instructions, see the chapter on Managing the Cluster Topology in the Administration Guide.

    First Node Up Gives Network Error Message in hacmp.out

    Problem

    The first node up in a HACMP cluster gives the following network error message in /tmp/hacmp.out, even if the network is healthy:

    Error: EVENT START: network_down -1 ELLA 
    

    Whether the network is functional or not, the RSCT topology services heartbeat interval expires, resulting in the logging of the above error message. This message is only relevant to non-IP networks (such as RS232, TMSCSI, TMSSA). This behavior does not occur for disk heartbeating networks (for which network_down events are not logged in general).

    Solution

    Ignore the message and let the cluster services continue to function. You should see this error message corrected in a healthy cluster as functional network communication is eventually established between other nodes in the cluster. A network_up event will be run after the second node that has an interface on this network joins the cluster. If cluster communication is not established after this error message, then the problem should be diagnosed in other sections of this guide that discuss network issues.

    Network Interface Card and Network ODMs Out of Sync with Each Other

    Problem

    In some situations, it is possible for the HACMPadapter or the HACMPnetwork ODMs to become out of sync with the AIX 5L ODMs. For example, HACMP may refer to an Ethernet network interface card while AIX 5L refers to a Token-Ring network interface card.

    Note: This type of out-of-sync condition only occurs as a result of the following situations:
  • If the hardware settings have been adjusted after the HACMP cluster has been successfully configured and synchronized
  • or

  • If the wrong values were selected when configuring predefined communication interfaces to HACMP.
  • Solution

    Run cluster verification to detect and report the following network and network interface card type incompatibilities:

  • The network interface card configured in HACMP is the correct one for the node’s hardware
  • The network interface cards configured in HACMP and AIX 5L match each other.
  • If verification returns an error, examine and adjust the selections made on the Extended Configuration > Extended Topology Configuration > Configuring HACMP Communication Interfaces/Devices > Change/Show Communication Interfaces/Devices SMIT panel. For more information on this screen, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) of the Administration Guide.

    Non-IP Network, Network Adapter or Node Failures

    Problem

    The non-IP interface declares its neighbor down after the Failure Detection Rate has expired for that network interface type. HACMP waits the same interval again before declaring the local interface down (if no heartbeat is received from the neighbor).

    Solution

    The non-IP heartbeating helps determine the difference between a NIC failure, network failure, and even more importantly node failure. When a non-IP network failure occurs, HACMP detects a non-IP network down and logs an error message in the /tmp/hacmp.out file.

    Use the clstat -s command to display the service IP labels for non-IP networks that are currently down on a network.

    The RSCT topsvcs daemon logs messages whenever an interface changes state. These errors are visible in the errpt.

    For more information, see the section Changing the Configuration of a Network Module in the chapter on Managing the Cluster Topology in the Administration Guide.

    Networking Problems Following HACMP Fallover

    Problem

    If you are using Hardware Address Takeover (HWAT) with any gigabit Ethernet adapters supporting flow control, you may be exposed to networking problems following an HACMP fallover. If a system crash occurs on one node and power still exists for the adapters on the crashed node, even though the takeover is successful, the network connection to the takeover node may be lost or the network containing the failing adapters may lock up. The problem is related to flow control being active on the gigabit Ethernet adapter in conjunction with how the Ethernet switch handles this situation.

    Solution

    Turn off flow control on the gigabit Ethernet adapters.

    To disable flow control on the adapter type:

    ifconfig entX detach  
        # where entX corresponds to the gigabit adapter device 
    chdev -l entX -a flow_ctrl=no 
    

    Then reconfigure the network on that adapter.

    Packets Lost during Data Transmission

    Problem

    If data is intermittently lost during transmission, it is possible that the maximum transmission unit (MTU) has been set to different sizes on different nodes. For example, if Node A sends 8 K packets to Node B, which can accept 1.5 K packets, Node B assumes the message is complete; however data may have been lost.

    Solution

    Run the cluster verification utility to ensure that all of the network interface cards on all cluster nodes during the same network have the same setting for MTU size. If the MTU size is inconsistent across the network, an error displays, which enables you to determine which nodes to adjust.

    Note: You can change an MTU size by using the following command:
    chev -l en0 -a mtu=<new_value_from_1_to_8>

    Verification Fails when Geo Networks Uninstalled

    Problem

    HAGEO uninstalled, but Geo network definitions remain and cluster verification fails.

    Solution

    After HAGEO is uninstalled, any HACMP networks which are still defined as type Geo_Primary or Geo_Secondary must either be removed, or their type must be modified to correspond to the network type (such as Ethernet, Token Ring, RS232). HACMP verification will fail unless these changes are made to the HACMP network definitions.

    Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks

    Problem

    Missing entries in the /etc/hosts for the netmon.cf file may prevent your networks from being properly monitored by the netmon utility of the RSCT Topology Services.

    Solution

    Make sure to include the entries for the netmon.cf file each IP address and its corresponding label in the /etc/hosts file. If the entries are missing, it may result in the NIM process of RSCT being blocked while RSCT attempts to determine the state of the local adapters.

    In general, we recommend to create the netmon.cf file for the cluster configurations where there are networks that under certain conditions can become single adapter networks. In such networks, it can be difficult for HACMP to accurately determine adapter failure. This is because RSCT Topology Services cannot force packet traffic over the single adapter to verify its operation. The creation of the netmon.cf file allows RSCT to accurately determine adapter failure.

    For more information on creating the netmon.cf file, see the Planning Guide.

    Cluster Communications Issues

    The following potential cluster communications issues are described here:

  • Message Encryption Fails
  • Cluster Nodes Do Not Communicate with Each Other.
  • Message Encryption Fails

    Problem

    If you have message authentication or message authentication and encryption enabled, and you receive a message that encryption fails or that a message could not be decrypted.

    Solution

    If the encryption filesets are not found on the local node, a message indicates that the encryption libraries were not found.

    If you did not receive a message that encryption libraries could not be found on the local node, check the clcomd.log file to determine if the encryption filesets are not found on a remote node.

    Verify whether the cluster node has the following filesets installed:

  • For data encryption with DES message authentication: rsct.crypt.des
  • For data encryption standard Triple DES message authentication: rsct.crypt.3des
  • For data encryption with Advanced Encryption Standard (AES) message authentication: rsct.crypt.aes256.
  • If needed, install these filesets from the AIX 5L Expansion Pack CD-ROM.

    If the filesets are installed after HACMP is already running, start and stop the HACMP Cluster Communications daemon to enable HACMP to use these filesets. To restart the Cluster Communications daemon:

    stopscr -s clcomdes
    startsrc -s clcomdes

    If the filesets are present, and you get an encryption error, the encryption filesets may have been installed, or reinstalled, after HACMP was running. In this case, restart the Cluster Communications daemon as described above.

    Cluster Nodes Do Not Communicate with Each Other

    Problem

    Cluster nodes are unable to communicate with each, and you have one of the following configured:

  • Message authentication, or message authentication and encryption enabled
  • Use of persistent IP labels for VPN tunnels.
  • Solution

    Make sure that the network is operational, see the section Network and Switch Issues.

    Check if the cluster has persistent IP labels. If it does, make sure that they are configured correctly and that you can ping the IP label.

    If you are using message authentication, or message authentication and encryption:

  • Make sure that each cluster node has the same setting for message authentication mode. If the modes are different, on each node set message authentication mode to None and configure message authentication again.
  • Make sure that each node has the same type of encryption key in the /usr/es/sbin/cluster/etc directory. Encryption keys cannot reside in other directories.
  • If you have configured use of persistent IP labels for a VPN:

      1. Change User Persistent Labels to No.
      2. Synchronize cluster configuration.
      3. Change User Persistent Labels to Yes.

    HACMP Takeover Issues

    Note that if you are investigating resource group movement in HACMP—for instance, investigating why an rg_move event has occurred—always check the /tmp/hacmp.out file. In general, given the recent changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in HACMP, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In addition, with parallel processing of resource groups, the hacmp.out file reports details that cannot be seen in the cluster history log or the clstrmgr.debug log file. Always check the hacmp.out log early on when investigating resource group movement after takeover activity.

    The following potential takeover issues are described here:

  • varyonvg Command Fails during Takeover
  • Highly Available Applications Fail
  • Node Failure Detection Takes Too Long
  • HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L
  • Group Services Sends GS_DOM_MERGE_ER Message
  • cfgmgr Command Causes Unwanted Behavior in Cluster
  • Releasing Large Amounts of TCP Traffic Causes DMS Timeout
  • Deadman Switch Causes a Node Failure
  • Deadman Switch Time to Trigger
  • A “device busy” Message Appears after node_up_local Fails
  • Network Interfaces Swap Fails Due to an rmdev “device busy” Error
  • MAC Address Is Not Communicated to the Ethernet Switch.
  • varyonvg Command Fails during Takeover

    Problem

    The HACMP software failed to vary on a shared volume group. The volume group name is either missing or is incorrect in the HACMP Configuration Database object class.

    Solution

  • Check the /tmp/hacmp.out file to find the error associated with the varyonvg failure.
  • List all the volume groups known to the system using the lsvg command; then check that the volume group names used in the HACMPresource Configuration Database object class are correct. To change a volume group name in the Configuration Database, from the main HACMP SMIT panel select Initialization and Standard Configuration > Configure HACMP Resource Groups > Change/Show Resource Groups, and select the resource group where you want the volume group to be included. Use the Volume Groups or Concurrent Volume Groups fields on the Change/Show Resources and Attributes for a Resource Group panel to set the volume group names. After you correct the problem, use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing.
  • Run the cluster verification utility to verify cluster resources.
  • Highly Available Applications Fail

    Problem 1

    Highly available applications fail to start on a fallover node after an IP address takeover. The hostname may not be set.

    Solution 1

    Some software applications require an exact hostname match before they start. If your HACMP environment uses IP address takeover and starts any of these applications, add the following lines to the script you use to start the application servers:

    mkdev -t inet 
    chdev -l inet0 -a hostname=nnn 
    

    where nnn is the hostname of the machine the fallover node is masquerading as.

    Problem 2

    An application that a user has manually stopped following a stop of cluster services where resource groups were placed in an UNMANAGED state, does not restart with reintegration of the node.

    Solution 2

    Check that the relevant application entry in the /usr/es/sbin/cluster/server.status file has been removed prior to node reintegration.

    Since an application entry in the /usr/es/sbin/cluster/server.status file lists all applications already running on the node, HACMP will not restart the applications with entries in the server.status file.

    Deleting the relevant application server.status entry before reintegration, allows HACMP to recognize that the highly available application is not running, and that it must be restarted on the node.

    Node Failure Detection Takes Too Long

    Problem

    The Cluster Manager fails to recognize a node failure in a cluster configured with a Token-Ring network. The Token-Ring network cannot become stable after a node failure unless the Cluster Manager allows extra time for failure detection.

    In general, a buffer time of 14 seconds is used before determining failures on a Token-Ring network. This means that all Cluster Manager failure modes will take an extra 14 seconds if the Cluster Manager is dealing with Token-Ring networks. This time, however, does not matter if the Cluster Manager is using both Token-Ring and Ethernet. If Cluster Manager traffic is using a Token-Ring network interface, the 14 extra seconds for failures applies.

    Solution

    If the extra time is not acceptable, you can switch to an alternative network, such as an Ethernet. Using a non-IP heartbeating network (such as RS232) as recommended for all clusters should prevent this problem.

    For some configurations, it is possible to run all the cluster network traffic on a separate network (Ethernet), even though a Token-Ring network also exists in the cluster. When you configure the cluster, include only the interfaces used on this separate network. Do not include the Token-Ring interfaces.

    Since the Cluster Manager has no knowledge of the Token-Ring network, the 14-second buffer does not apply; thus failure detection occurs faster. Since the Cluster Manager does not know about the Token-Ring network interfaces, it cannot monitor them, nor can it swap network interfaces if one of the network interfaces fails or if the cables are unplugged.

    HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L

    Problem

    HACMP fails to selectively move the affected resource group to another cluster node when a volume group quorum loss occurs.

    Solution

    If quorum is lost for a volume group that belongs to a resource group on a cluster node, the system checks whether the LVM_SA_QUORCLOSE error appeared in the node’s AIX 5L error log file and informs the Cluster Manager to selectively move the affected resource group. HACMP uses this error notification method only for mirrored volume groups with quorum enabled.

    If fallover does not occur, check that the LVM_SA_QUORCLOSE error appeared in the AIX 5L error log. When the AIX 5L error log buffer is full, new entries are discarded until buffer space becomes available and an error log entry informs you of this problem. To resolve this issue, increase the size of the AIX 5L error log internal buffer for the device driver. For information about increasing the size of the error log buffer, see the AIX 5L documentation listed in About This Guide.

    Group Services Sends GS_DOM_MERGE_ER Message

    Problem

    A Group Services merge message is displayed and the node receiving the message shuts itself down. You see a GS_DOM_MERGE_ER error log entry, as well as a message in the Group Services daemon log file:

    “A better domain XXX has been discovered, or domain master requested to dissolve the domain.”

    A Group Services merge message is sent when a node loses communication with the cluster and then tries to reestablish communication.

    Solution

    Because it may be difficult to determine the state of the missing node and its resources (and to avoid a possible data divergence if the node rejoins the cluster), you should shut down the node and successfully complete the takeover of its resources.

    For example, if a cluster node becomes unable to communicate with other nodes, yet it continues to work through its process table, the other nodes conclude that the “missing” node has failed because they no longer are receiving keepalive messages from the “missing” node. The remaining nodes then process the necessary events to acquire the disks, IP addresses, and other resources from the “missing” node. This attempt to take over resources results in the dual-attached disks receiving resets to release them from the “missing” node and to start IP address takeover scripts.

    As the disks are being acquired by the takeover node (or after the disks have been acquired and applications are running), the “missing” node completes its process table (or clears an application problem) and attempts to resend keepalive messages and rejoin the cluster. Since the disks and IP address have been successfully taken over, it becomes possible to have a duplicate IP address on the network and the disks may start to experience extraneous traffic on the data bus.

    Because the reason for the “missing” node remains undetermined, you can assume that the problem may repeat itself later, causing additional downtime of not only the node but also the cluster and its applications. Thus, to ensure the highest cluster availability, GS merge messages should be sent to any “missing” cluster node to identify node isolation, to permit the successful takeover of resources, and to eliminate the possibility of data corruption that can occur if both the takeover node and the rejoining “missing” node attempt to write to the disks. Also, if two nodes exist on the network with the same IP address, transactions may be missed and applications may hang.

    When you have a partitioned cluster, the node(s) on each side of the partition detect this and run a node_down for the node(s) on the opposite side of the partition. If while running this or after communication is restored, the two sides of the partition do not agree on which nodes are still members of the cluster, a decision is made as to which partition should remain up, and the other partition is shutdown by a GA merge from nodes in the other partition or by a node sending a GS merge to itself.

    In clusters consisting of more than two nodes the decision is based on which partition has the most nodes left in it, and that partition stays up. With an equal number of nodes in each partition (as is always the case in a two-node cluster) the node(s) that remain(s) up is determined by the node number (lowest node number in cluster remains) which is also generally the first in alphabetical order.

    Group Services domain merge messages indicate that a node isolation problem was handled to keep the resources as highly available as possible, giving you time to later investigate the problem and its cause. When a domain merge occurs, Group Services and the Cluster Manager exit. The clstrmgr.debug file will contain the following error:

    "announcementCb: GRPSVCS announcement code=n; exiting"  
    "CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs)" 
    

    cfgmgr Command Causes Unwanted Behavior in Cluster

    Problem

    SMIT commands like Configure Devices Added After IPL use the cfgmgr command. Sometimes this command can cause unwanted behavior in a cluster. For instance, if there has been a network interface swap, the cfgmgr command tries to reswap the network interfaces, causing the Cluster Manager to fail.

    Solution

    See the Installation Guide for information about modifying rc.net, thereby bypassing the issue. You can use this technique at all times, not just for IP address takeover, but it adds to the overall takeover time, so it is not recommended.

    Releasing Large Amounts of TCP Traffic Causes DMS Timeout

    Large amounts of TCP traffic over an HACMP-controlled service interface may cause AIX 5L to experience problems when queuing and later releasing this traffic. When traffic is released, it generates a large CPU load on the system and prevents timing-critical threads from running, thus causing the Cluster Manager to issue a deadman switch (DMS) timeout.

    To reduce performance problems caused by releasing large amounts of TCP traffic into a cluster environment, consider increasing the Failure Detection Rate beyond Slow to a time that can handle the additional delay before a takeover. See the Changing the Failure Detection Rate of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.

    Also, to lessen the probability of a DMS timeout, complete the following steps before issuing a node_down:

      1. Use the netstat command to identify the ports using an HACMP-controlled service network interface.
      2. Use the ps command to identify all remote processes logged to those ports.
      3. Use the kill command to terminate these processes.

    Deadman Switch Causes a Node Failure

    Problem

    The node experienced an extreme performance problem, such as a large I/O transfer, excessive error logging, or running out of memory, and the Topology Services daemon (hatsd) is starved for CPU time. It could not reset the deadman switch within the time allotted. Misbehaved applications running at a priority higher than the Cluster Manager can also cause this problem.

    Solutions

    The deadman switch describes the AIX 5L kernel extension that causes a system panic and dump under certain cluster conditions if it is not reset. The deadman switch halts a node when it enters a hung state that extends beyond a certain time limit. This enables another node in the cluster to acquire the hung node’s resources in an orderly fashion, avoiding possible contention problems. Solutions related to performance problems should be performed in the following order:

      1. Tune the system using I/O pacing and increasing the syncd frequency as directed in the chapter on Configuring AIX 5L for HACMP in the Installation Guide.
      2. If needed, increase the amount of memory available for the communications subsystem.
      3. Tune virtual memory management (VMM). This is explained below.
      4. Change the Failure Detection Rate. For more information, see the Changing the Failure Detection Rate of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.

    Tuning Virtual Memory Management

    For most customers, increasing minfree/maxfree whenever the freelist gets below minfree by more than 10 times the number of memory pools is necessary to allow a system to maintain consistent response times. To determine the current size of the freelist, use the vmstat command. The size of the freelist is the value labeled free. The number of memory pools in a system is the maximum of the number of CPUs/8 or memory size in GB/16, but never more than the number of CPUs and always at least one. The value of minfree is shown by the vmtune command.

    In systems with multiple memory pools, it may also be important to increase minfree/maxfree even though minfree will not show as 120, since the default minfree is 120 times the number of memory pools. If raising minfree/maxfree is going to be done, it should be done with care, that is, not setting it too high since this may mean too many pages on the freelist for no real reason. One suggestion is to increase minfree and maxfree by 10 times the number of memory pools, then observe the freelist again. In specific application environments, such as multiple processes (three or more) each reading or writing a very large sequential file (at least 1GB in size each) it may be best to set minfree relatively high, e.g. 120 times the number of CPUs, so that maximum throughput can be achieved.

    This suggestion is specific to a multi-process large sequential access environment. Maxfree, in such high sequential I/O environments, should also be set more than just 8 times the number of CPUs higher than minfree, e.g. maxfree = minfree + (maxpgahead x the number of CPUs), where minfree has already been determined using the above formula. The default for maxpgahead is 8, but in many high sequential activity environments, best performance is achieved with maxpgahead set to 32 or 64. This suggestion applies to all pSeries models still being marketed, regardless of memory size. Without these changes, the chances of a DMS timeout can be high in these specific environments, especially those with minimum memory size.

    For database environments, these suggestions should be modified. If JFS files are being used for database tables, then watching minfree still applies, but maxfree could be just minfree + (8 x the number of memory pools). If raw logical volumes are being used, the concerns about minfree/maxfree do not apply, but the following suggestion about maxperm is relevant.

    In any environment (HA or otherwise) that is seeing non-zero paging rates, it is recommended that maxperm be set lower than the default of ~80%. Use the avm column of vmstat as an estimate of the number of working storage pages in use, or the number of valid memory pages, (should be observed at full load on the system’s real memory, as shown by vmtune) to determine the percentage of real memory occupied by working storage pages. For example, if avm shows as 70% of real memory size, then maxperm should be set to 25% (vmtune -P 25). The basic formula used here is maxperm = 95 - avm/memory size in pages. If avm is less than or equal to 95% of memory, then this system is memory constrained. The options at this point are to set maxperm to 5% and incur some paging activity, add additional memory to this system, or to reduce the total workload run simultaneously on the system so that avm is lowered.

    Deadman Switch Time to Trigger

    The Topology Services chapter in the Parallel System Support Programs for AIX Diagnosis Guide has several hints about how to avoid having the hatsd blocked which causes the deadman switch (DMS) to hit. The relevant information is in the Diagnostic Procedure section of the chapter. See “Action 5 - Investigate hatsd problem” and “Action 8 - Investigate node crash”. The URL for this Guide follows:

    http://publibfp.boulder.ibm.com/epubs/pdf/a2273503.pdf

    Running the /usr/sbin/rsct/bin/hatsdmsinfo command

    This command is useful for checking on the deadman switch trigger time.

    Output of the /usr/sbin/rsct/bin/hatsdmsinfo command looks like this:

    ======================================================== 
    Information for Topology Services -- HACMP /ES 
    DMS Trigger time: 20.000 seconds. 
    Last DMS Resets                          Time to Trigger (seconds) 
    06/04/02 06:51:53.064                    19.500 
    06/04/02 06:51:53.565                    19.499 
    06/04/02 06:51:54.065                    19.500 
    06/04/02 06:51:54.565                    19.500 
    06/04/02 06:51:55.066                    19.500 
    06/04/02 06:51:55.566                    19.499 
    DMS Resets with small time-to-trigger    Time to Trigger (seconds) 
    Threshold value: 15.000 seconds. 
    

    A “device busy” Message Appears after node_up_local Fails

    Problem

    A device busy message in the /tmp/hacmp.out file appears when swapping hardware addresses between the boot and service address. Another process is keeping the device open.

    Solution

    Check to see if sysinfod, the SMUX peer daemon, or another process is keeping the device open. If it is sysinfod, restart it using the -H option.

    Network Interfaces Swap Fails Due to an rmdev “device busy” Error

    Problem

    Network interfaces swap fails due to an rmdev device busy error. For example, /tmp/hacmp.out shows a message similar to the following:

    Method error (/etc/methods/ucfgdevice): 
    0514-062 Cannot perform the requested function because the specified 
    device is busy. 
    

    Solution

    Check to see whether the following applications are being run on the system. These applications may keep the device busy:

  • SNA
  • Use the following commands to see if SNA is running:
    lssrc -g sna

    Use the following command to stop SNA:

    stopsrc -g sna

    If that does not work, use the following command:

    stopsrc -f -s sna

    If that does not work, use the following command:

    /usr/bin/sna -stop sna -t forced

    If that does not work, use the following command:

    /usr/bin/sna -stop sna -t cancel
  • Netview / Netmon
  • Ensure that the sysmond daemon has been started with a -H flag. This will result in opening and closing the network interface each time SM/6000 goes out to read the status, and allows the cl_swap_HW_address script to be successful when executing the rmdev command after the ifconfig detach before swapping the hardware address.

    Use the following command to stop all Netview daemons:

    /usr/OV/bin/nv6000_smit stopdaemons
  • IPX
  • Use the following commands to see if IPX is running:
    ps -ef |grep npsd
    ps -ef |grep sapd

    Use the following command to stop IPX:

    /usr/lpp/netware/bin/stopnps
  • NetBIOS.
  • Use the following commands to see if NetBIOS is running:
    ps -ef | grep netbios

    Use the following commands to stop NetBIOS and unload NetBIOS streams:

    mcsadm stop; mcs0 unload
  • Unload various streams if applicable (that is, if the file exists):
  • cd /etc
    strload -uf /etc/dlpi.conf
    strload -uf /etc/pse.conf
    strload -uf /etc/netware.conf
    strload -uf /etc/xtiso.conf
  • Some customer applications will keep a device busy. Ensure that the shared applications have been stopped properly.
  • MAC Address Is Not Communicated to the Ethernet Switch

    Problem

    With switched Ethernet networks, MAC address takeover sometimes appears to not function correctly. Even though HACMP has changed the MAC address of the network interface, the switch is not informed of the new MAC address. The switch does not then route the appropriate packets to the network interface.

    Solution

    Do the following to ensure that the new MAC address is communicated to the switch:

      1. Modify the line in /usr/es/sbin/cluster/etc/clinfo.rc that currently reads:
    PING_CLIENT_LIST=" "
      2. Include on this line the names or IP addresses of at least one client on each subnet on the switched Ethernet.
      3. Run clinfoES on all nodes in the HACMP cluster that are attached to the switched Ethernet.
    If you normally start HACMP cluster services using the /usr/es/sbin/cluster/etc/rc.cluster shell script, specify the -i option. If you normally start HACMP cluster services through SMIT, specify yes in the Start Cluster Information Daemon? field.

    Client Issues

    The following potential HACMP client issues are described here:

  • Network Interface Swap Causes Client Connectivity Problem
  • Clients Cannot Access Applications
  • Clients Cannot Find Clusters
  • Clinfo Does Not Appear to Be Running
  • Clinfo Does Not Report That a Node Is Down.
  • Network Interface Swap Causes Client Connectivity Problem

    Problem

    The client cannot connect to the cluster. The ARP cache on the client node still contains the address of the failed node, not the fallover node.

    Solution

    Issue a ping command to the client from a cluster node to update the client’s ARP cache. Be sure to include the client name as the argument to this command. The ping command will update a client’s ARP cache even if the client is not running clinfoES. You may need to add a call to the ping command in your application’s pre- or post-event processing scripts to automate this update on specific clients. Also consider using hardware address swapping, since it will maintain configured hardware-to-IP address mapping within your cluster.

    Clients Cannot Access Applications

    Problem

    The SNMP process failed.

    Solution

    Check the /etc/hosts file on the node on which SNMP failed to ensure that it contains IP labels or addresses of cluster nodes. Also see Clients Cannot Find Clusters.

    Clients Cannot Find Clusters

    Problem

    The clstat utility running on a client cannot find any clusters. The clinfoES daemon has not properly managed the data structures it created for its clients (like clstat) because it has not located an SNMP process with which it can communicate. Because clinfoES obtains its cluster status information from SNMP, it cannot populate the HACMP MIB if it cannot communicate with this daemon. As a result, a variety of intermittent problems can occur between SNMP and clinfoES.

    Solution

    Create an updated client-based clhosts file by running verification with automatic corrective actions enabled. This produces a clhosts.client file on the server nodes. Copy this file to the /usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. The clinfoES daemon uses the addresses in this file to attempt communication with an SNMP process executing on an HACMP server.

    Warning: For non-alias IP networks do not include standby addresses in the clhosts file.

    Also, check the /etc/hosts file on the node on which the SNMP process is running and on the node having problems with clstat or other clinfo API programs.

    Clinfo Does Not Appear to Be Running

    Problem

    The service and boot addresses of the cluster node from which clinfoES was started do not exist in the client-based clhosts file.

    Solution

    Create an updated client-based clhosts file by running verification with automatic corrective actions enabled. This produces a clhosts.client file on the server nodes. Copy this file to the /usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. Then run the clstat command.

    Clinfo Does Not Report That a Node Is Down

    Problem

    Even though the node is down, the SNMP daemon and clinfoES report that the node is up. All the node’s interfaces are listed as down.

    Solution

    When one or more nodes are active and another node tries to join the cluster, the current cluster nodes send information to the SNMP daemon that the joining node is up. If for some reason, the node fails to join the cluster, clinfoES does not send another message to the SNMP daemon the report that the node is down.

    To correct the cluster status information, restart the SNMP daemon, using the options on the HACMP Cluster Services SMIT panel.

    Miscellaneous Issues

    The following non-categorized HACMP issues are described here:

  • Limited Output when Running the tail -f Command on /tmp/hacmp.out
  • CDE Hangs after IPAT on HACMP Startup
  • Cluster Verification Gives Unnecessary Message
  • config_too_long Message Appears
  • Console Displays SNMP Messages
  • Device LEDs Flash “888” (System Panic)
  • Unplanned System Reboots Cause Fallover Attempt to Fail
  • Deleted or Extraneous Objects Appear in NetView Map
  • F1 Does not Display Help in SMIT Panels
  • /usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large
  • View Event Summaries Does Not Display Resource Group Information as Expected
  • Application Monitor Problems
  • Cluster Disk Replacement Process Fails
  • Resource Group Unexpectedly Processed Serially
  • rg_move Event Processes Several Resource Groups at Once
  • Filesystem Fails to Unmount
  • Dynamic Reconfiguration Sets a Lock
  • WebSMIT Does Not “See” the Cluster.
  • Note that if you are investigating resource group movement in HACMP—for instance, investigating why an rg_move event has occurred—always check the /tmp/hacmp.out file. In general, given the recent changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in HACMP, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In addition, with parallel processing of resource groups, the hacmp.out file reports details that will not be seen in the cluster history log or the clstrmgr.debug file. Always check this log early on when investigating resource group movement after takeover activity.

    Limited Output when Running the tail -f Command on /tmp/hacmp.out

    Problem

    Only script start messages appear in the /tmp/hacmp.out file. The script specified in the message is not executable, or the DEBUG level is set to low.

    Solution

    Add executable permission to the script using the chmod command, and make sure the DEBUG level is set to high.

    CDE Hangs after IPAT on HACMP Startup

    Problem

    If CDE is started before HACMP is started, it binds to the boot address. When HACMP is started it swaps the IP address to the service address. If CDE has already been started this change in the IP address causes it to hang.

    Solution

  • The output of hostname and the uname -n must be the same. If the output is different, use uname -S hostname to make the uname match the output from hostname.
  • Define an alias for the hostname on the loopback address. This can be done by editing /etc/hosts to include an entry for:
  • 127.0.0.1 loopback localhost hostname

    where hostname is the name of your host. If name serving is being used on the system edit the /etc/netsvc.conf file such that the local file is checked first when resolving names.

  • Ensure that the hostname and the service IP label resolve to different addresses. This can be determine by viewing the output of the /bin/host command for both the hostname and the service IP label.
  • Cluster Verification Gives Unnecessary Message

    Problem

    You get the following message regardless of whether or not you have configured Auto Error Notification:

    “Remember to redo automatic error notification if configuration 
    has changed.” 
    

    Solution

    Ignore this message if you have not configured Auto Error Notification.

    config_too_long Message Appears

    This message appears each time a cluster event takes more time to complete than a specified time-out period.

    In versions prior to 4.5, the time-out period was fixed for all cluster events and set to 360 seconds by default. If a cluster event, such as a node_up or a node_down event, lasted longer than 360 seconds, then every 30 seconds HACMP issued a config_too_long warning message that was logged in the hacmp.out file.

    In HACMP 4.5 and up, you can customize the time period allowed for a cluster event to complete before HACMP issues a system warning for it.

    If this message appears, in the hacmp.out Event Start you see:

    config_too_long $sec $event_name $argument 
    
  • $event_name is the reconfig event that failed
  • $argument is the parameter(s) used by the event
  • $sec is the number of seconds before the message was sent out.
  • In versions prior to HACMP 4.5, config_too_long messages continued to be appended to the hacmp.out file every 30 seconds until action was taken.

    Starting with version 4.5, for each cluster event that does not complete within the specified event duration time, config_too_long messages are logged in the hacmp.out file and sent to the console according to the following pattern:

  • The first five config_too_long messages appear in the hacmp.out file at 30-second intervals
  • The next set of five messages appears at interval that is double the previous interval until the interval reaches one hour
  • These messages are logged every hour until the event is complete or is terminated on
    that node.
  • This message could appear in response to the following problems:

    Problem

    Activities that the script is performing take longer than the specified time to complete; for example, this could happen with events involving many disks or complex scripts.

    Solution

  • Determine what is taking so long to execute, and correct or streamline that process if possible.
  • Increase the time to wait before calling config_too_long.
  • You can customize Event Duration Time using the Change/Show Time Until Warning panel in SMIT. Access this panel through the Extended Configuration > Extended Event Configuration SMIT panel.

    For complete information on tuning event duration time, see the Tuning Event Duration Time Until Warning section in the chapter on Configuring Cluster Events in the Administration Guide.

    Problem

    A command is hung and event script is waiting before resuming execution. If so, you can probably see the command in the AIX 5L process table (ps -ef). It is most likely the last command in the /tmp/hacmp.out file, before the config_too_long script output.

    Solution

    You may need to kill the hung command. See also Dynamic Reconfiguration Sets a Lock.

    Console Displays SNMP Messages

    Problem

    The /etc/syslogd file has been changed to send the daemon.notice output to /dev/console.

    Solution

    Edit the /etc/syslogd file to redirect the daemon.notice output to /usr/tmp/snmpd.log. The snmpd.log file is the default location for logging messages.

    Device LEDs Flash “888” (System Panic)

    Problem

    The crash system dump device with stat subcommand indicates the panic was caused by the deadman switch. The hats daemon cannot obtain sufficient time to process CPU cycles during intensive operations (df, find, for example) and may be required to wait too long for a chance at the kernel lock. Often, more than five seconds will elapse before the hatsd can get a lock. The results are the invocation of the deadman switch and a system panic.

    Solution

    Determine what process is hogging CPU cycles on the system that panicked. Then attempt (in order) each of the following solutions that address this problem:

      1. Tune the system using I/O pacing.
      2. Increase the syncd frequency.
      3. Change the Failure Detection Rate.

    For instructions on these procedures, see the sections under Deadman Switch Causes a Node Failure earlier in this chapter.

    Unplanned System Reboots Cause Fallover Attempt to Fail

    Problem

    Cluster nodes did not fallover after rebooting the system.

    Solution

    To prevent unplanned system reboots from disrupting a fallover in your cluster environment, all nodes in the cluster should either have the Automatically REBOOT a system after a crash field on the Change/Show Characteristics of Operating System SMIT panel set to false, or you should keep the IBM eServer pSeries key in Secure mode during normal operation.

    Both measures prevent a system from rebooting if the shutdown command is issued inadvertently. Without one of these measures in place, if an unplanned reboot occurs the activity against the disks on the rebooting node can prevent other nodes from successfully acquiring the disks.

    Deleted or Extraneous Objects Appear in NetView Map

    Problem

    Previously deleted or extraneous object symbols appeared in the NetView map.

    Solution

    Rebuild the NetView database.

    To rebuild the NetView database, perform the following steps on the NetView server:

      1. Stop all NetView daemons: /usr/OV/bin/ovstop -a
      2. Remove the database from the NetView server: rm -rf /usr/OV/database/*
      3. Start the NetView object database: /usr/OV/bin/ovstart ovwdb
      4. Restore the NetView/HAView fields: /usr/OV/bin/ovw -fields
      5. Start all NetView daemons: /usr/OV/bin/ovstart -a

    F1 Does not Display Help in SMIT Panels

    Problem

    Pressing F1 in SMIT panel does not display help.

    Solution

    Help can be displayed only if the LANG variable is set to one of the languages supported by HACMP, and if the associated HACMP message catalogs are installed. The languages supported by HACMP 5.4 are:

    en_US
    ja_JP
    En_US
    Ja_JP

    To list the installed locales (the bsl LPPs), type:

    locale -a 
    

    To list the active locale, type:

    locale 
    

    Since the LANG environment variable determines the active locale, if LANG=en_US, the locale is en_US.

    /usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large

    Problem

    In HACMP, event summaries are pulled from the hacmp.out file and stored in the cl_event_summary.txt file. This file continues to accumulate as hacmp.out cycles, and is not automatically truncated or replaced. Consequently, it can grow too large and crowd your /usr directory.

    Solution

    Clear event summaries periodically, using the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Remove HACMP Event Summaries > Remove Event Summary History option in SMIT.

    View Event Summaries Does Not Display Resource Group Information as Expected

    Problem

    In HACMP, event summaries are pulled from the hacmp.out file and can be viewed using the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Delete Event Summaries > View Event Summaries option in SMIT. This display includes resource group status and location information at the end. The resource group information is gathered by clRGinfo, and may take extra time if the cluster is not running when running the View Event Summaries option.

    Solution

    clRGinfo displays resource group information more quickly when the cluster is running.

    If the cluster is not running, wait a few minutes and the resource group information will eventually appear.

    Application Monitor Problems

    If you are running application monitors you may encounter occasional problems or situations in which you want to check the state or the configuration of a monitor. Here are some possible problems and ways to diagnose and act on them.

    Problem 1

    Checking the State of an Application Monitor. In some circumstances, it may not be clear whether an application monitor is currently running or not. To check on the state of an application monitor, run the following command:

    ps -ef | grep <application server name> | grep clappmond 
    

    This command produces a long line of verbose output if the application is being monitored.

    If there is no output, the application is not being monitored.

    Solution 1

    If the application monitor is not running, there may be a number of reasons, including

  • No monitor has been configured for the application server
  • The monitor has not started yet because the stabilization interval has not completed
  • The monitor is in a suspended state
  • The monitor was not configured properly
  • An error has occurred.
  • Check to see that a monitor has been configured, the stabilization interval has passed, and the monitor has not been placed in a suspended state, before concluding that something is wrong.

    If something is clearly wrong, reexamine the original configuration of the monitor in SMIT and reconfigure as needed.

    Problem 2

    Application Monitor Does Not Perform Specified Failure Action. The specified failure action does not occur even when an application has clearly failed.

    Solution 2

    Check the Restart Interval. If set too short, the Restart Counter may be reset to zero too quickly, resulting in an endless series of restart attempts and no other action taken.

    Cluster Disk Replacement Process Fails

    Problem

    The disk replacement process fails while the replacepv command was running.

    Solution

    Be sure to delete the /tmp/replacepv directory, and attempt the replacement process again.

    You can also try running the process on another disk.

    Resource Group Unexpectedly Processed Serially

    Problem

    A resource group is unexpectedly processed serially even though you did not request it to be this way.

    Solution

    Check for the site policy that is specified for this resource group, and make sure it is set to Ignore. Then delete this resource group from the customized serial processing order list in SMIT and synchronize the cluster.

    rg_move Event Processes Several Resource Groups at Once

    Problem

    In hacmp.out, you see that an rg_move event processes multiple non-concurrent resource groups in one operation.

    Solution

    This is the expected behavior. In clusters with dependencies, HACMP processes all resource groups upon node_up events, via rg_move events. During a single rg_move event, HACMP can process multiple non-concurrent resource groups within one event. For an example of the output, see the Processing in Clusters with Dependent Resource Groups or Sites section.

    Filesystem Fails to Unmount

    Problem

    A filesystem is not unmounted properly during an event such as when you stop cluster services with the option to bring resource groups offline.

    Solution

    One of the more common reasons for a filesystem to fail being unmounted when you stop cluster services with the option to bring resource groups offline is because the filesystem is busy. In order to unmount a filesystem successfully, no processes or users can be accessing it at the time. If a user or process is holding it, the filesystem will be “busy” and will not unmount.

    The same issue may result if a file has been deleted but is still open.

    The script to stop an application should also include a check to make sure that the shared filesystems are not in use or deleted and in the open state. You can do this by using the fuser command. The script should use the fuser command to see what processes or users are accessing the filesystems in question. The PIDs of these processes can then be acquired and killed. This will free the filesystem so it can be unmounted.

    Refer to the AIX 5L man pages for complete information on this command.

    Dynamic Reconfiguration Sets a Lock

    Problem

    When attempting a DARE operation, an error message may be generated regarding a DARE lock if another DARE operation is in process, or if a previous DARE operation did not complete properly.

    The error message suggests that one should take action to clear the lock if a DARE operation is not in process. “In process” here refers to another DARE operation that may have just been issued, but it also refers to any previous DARE operation that did not complete properly.

    Solution

    The first step is to examine the /tmp/hacmp.out logs on the cluster nodes to determine the reason for the previous DARE failure. A config_too_long entry will likely appear in hacmp.out where an operation in an event script took too long to complete. If hacmp.out indicates that a script failed to complete due to some error, correct this problem and manually complete the remaining steps that are necessary to complete the event.

    Run the HACMP SMIT Problem Determination Tools > Recover from HACMP Script Failure option. This should bring the nodes in the cluster to the next complete event state.

    You can clear the DARE lock by selecting the HACMP SMIT option Problem Determination Tools > Release Locks Set by Dynamic Configuration if the HACMP SMIT Recover from HACMP Script Failure step did not do so.

    WebSMIT Does Not “See” the Cluster

    WebSMIT is designed to run on a single node. If that node goes down, WebSMIT will become unavailable. To increase availability, you can set up WebSMIT to run on multiple nodes. Since WebSMIT is retrieving and updating information from the HACMP cluster, that information should be available from all nodes in the cluster.

    Typically, you will set up WebSMIT to be accessible from a cluster's internal network but not reachable from the Internet. If sites are configured, and WebSMIT is running on a node on a remote site, you must ensure HTTP connectivity to that node; it is not handled automatically by WebSMIT or HACMP. HTTPS/SSL is highly recommended for security.

    Because WebSMIT runs on one node in the cluster, the functionality it provides and the information it displays directly correspond to the version of HACMP installed on that node. For HACMP 5.4 WebSMIT to work properly, you must have cluster services running on at least one node, and enable Javascript on the client.


    PreviousNextIndex