Chapter 1: Troubleshooting HACMP Clusters
Troubleshooting an HACMP Cluster Overview
Using the AIX Data Collection Utility
Checking a Cluster Configuration with Online Planning Worksheets
Using HACMP Diagnostic Utilities
Using the Problem Determination Tools
HACMP Log Viewing and Management
Recovering from HACMP Script Failure
Restoring HACMP Configuration Database from an Active Configuration
Release Locks Set by Dynamic Reconfiguration
Clear SSA Disk Fence Registers
Opening a SMIT Session on a Node
Configuring Cluster Performance Tuning
Resetting HACMP Tunable Values
Resetting HACMP Tunable Values Using SMIT
Resetting HACMP Tunable Values using the Command Line
Making cron jobs Highly Available
Making Print Queues Highly Available
Chapter 2: Using Cluster Log Files
Viewing HACMP Cluster Log Files
Reviewing Cluster Message Log Files
Understanding the cluster.log File
Understanding the hacmp.out Log File
Viewing Compiled hacmp.out Event Summaries
Understanding the System Error Log
Understanding the Cluster History Log File
Understanding the Cluster Manager Debug Log File
Understanding the cspoc.log File
Collecting Cluster Log Files for Problem Reporting
Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
Serial Processing Order Reflected in Event Summaries
Parallel Processing Order Reflected in Event Summaries
Job Types: Parallel Resource Group Processing
Disk Fencing with Serial or Parallel Processing
Processing in Clusters with Dependent Resource Groups or Sites
Managing a Node’s HACMP Log File Parameters
Redirecting HACMP Cluster Log Files
Steps for Redirecting a Cluster Log File
Chapter 3: Investigating System Components and Solving Common Problems
Investigating System Components
Checking Highly Available Applications
Checking for Cluster Configuration Problems
Checking a Cluster Snapshot File
Checking the Logical Volume Manager
Checking Volume Group Definitions
Checking the Varyon State of a Volume Group
Checking Mount Points, Permissions, and Filesystem Information
Checking Point-to-Point Connectivity
Checking the IP Address and Netmask
Checking Heartbeating over IP Aliases
Checking ATM Classic IP Hardware Addresses
Checking the AIX 5L Operating System
Checking Disks, Disk Adapters, and Disk Heartbeating Networks
Recovering from PCI Hot Plug NIC Failure
Checking Disk Heartbeating Networks
Checking the Cluster Communications Daemon
Cannot Find Filesystem at Boot Time
cl_convert Does Not Run Due to Failed Installation
Configuration Files Could Not Be Merged during Installation
ODMPATH Environment Variable Not Set Correctly
clinfo Daemon Exits after Starting
Node Powers Down; Cluster Manager Will Not Start
configchk Command Returns an Unknown Host Message
Cluster Manager Hangs during Reconfiguration
clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes
Pre- or Post-Event Does Not Exist on a Node after Upgrade
Node Fails During Configuration with “869” LED Display
Node Cannot Rejoin Cluster after Being Dynamically Removed
Resource Group Migration Is Not Persistent after Cluster Startup
SP Cluster Does Not Startup after Upgrade to HACMP 5.4
AIX 5L Volume Group Commands Cause System Error Reports
Verification Fails on Clusters with Disk Heartbeating Networks
varyonvg Command Fails on a Volume Group
fsck Command Fails at Boot Time
System Cannot Mount Specified Filesystems
Cluster Disk Replacement Process Fails
Automatic Error Notification Fails with Subsystem Device Driver
Filesystem Change Not Recognized by Lazy Update
Unexpected Network Interface Failure in Switched Networks
Cluster Nodes Cannot Communicate
Distributed SMIT Causes Unpredictable Results
System Crashes Reconnecting MAU Cables after a Network Failure
TMSCSI Will Not Properly Reintegrate when Reconnecting Bus
Recovering from PCI Hot Plug NIC Failure
Unusual Cluster Events Occur in Non-Switched Environments
Cannot Communicate on ATM Classic IP Network
Cannot Communicate on ATM LAN Emulation Network
IP Label for HACMP Disconnected from AIX 5L Interface
First Node Up Gives Network Error Message in hacmp.out
Network Interface Card and Network ODMs Out of Sync with Each Other
Non-IP Network, Network Adapter or Node Failures
Networking Problems Following HACMP Fallover
Packets Lost during Data Transmission
Verification Fails when Geo Networks Uninstalled
Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks
Cluster Nodes Do Not Communicate with Each Other
varyonvg Command Fails during Takeover
Highly Available Applications Fail
Node Failure Detection Takes Too Long
HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L
Group Services Sends GS_DOM_MERGE_ER Message
cfgmgr Command Causes Unwanted Behavior in Cluster
Releasing Large Amounts of TCP Traffic Causes DMS Timeout
Deadman Switch Causes a Node Failure
Deadman Switch Time to Trigger
A “device busy” Message Appears after node_up_local Fails
Network Interfaces Swap Fails Due to an rmdev “device busy” Error
MAC Address Is Not Communicated to the Ethernet Switch
Network Interface Swap Causes Client Connectivity Problem
Clients Cannot Access Applications
Clinfo Does Not Appear to Be Running
Clinfo Does Not Report That a Node Is Down
Limited Output when Running the tail -f Command on /tmp/hacmp.out
CDE Hangs after IPAT on HACMP Startup
Cluster Verification Gives Unnecessary Message
config_too_long Message Appears
Console Displays SNMP Messages
Device LEDs Flash “888” (System Panic)
Unplanned System Reboots Cause Fallover Attempt to Fail
Deleted or Extraneous Objects Appear in NetView Map
F1 Does not Display Help in SMIT Panels
/usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large
View Event Summaries Does Not Display Resource Group Information as Expected
Cluster Disk Replacement Process Fails
Resource Group Unexpectedly Processed Serially
rg_move Event Processes Several Resource Groups at Once
Dynamic Reconfiguration Sets a Lock
WebSMIT Does Not “See” the Cluster
Appendix B: Command Execution Language Guide