PreviousNextIndex

Contents


About This Guide

Chapter 1: Troubleshooting HACMP Clusters

Troubleshooting an HACMP Cluster Overview   

Becoming Aware of the Problem   

Determining a Problem Source   

Stopping the Cluster Manager   

Using the AIX Data Collection Utility   

Checking a Cluster Configuration with Online Planning Worksheets   

Using HACMP Diagnostic Utilities   

Verifying Expected Behavior   

Using the Problem Determination Tools   

HACMP Verification   

Viewing Current State   

HACMP Log Viewing and Management   

Recovering from HACMP Script Failure   

Restoring HACMP Configuration Database from an Active Configuration   

Release Locks Set by Dynamic Reconfiguration   

Clear SSA Disk Fence Registers   

HACMP Cluster Test Tool   

HACMP Trace Facility   

HACMP Event Emulation   

HACMP Error Notification   

Opening a SMIT Session on a Node   

Configuring Cluster Performance Tuning    

Setting I/O Pacing   

Setting Syncd Frequency   

Resetting HACMP Tunable Values   

Prerequisites and Limitations   

Listing Tunable Values   

Resetting HACMP Tunable Values Using SMIT   

Resetting HACMP Tunable Values using the Command Line   

Sample Custom Scripts   

Making cron jobs Highly Available   

Making Print Queues Highly Available   

Where You Go from Here   

Chapter 2: Using Cluster Log Files

Viewing HACMP Cluster Log Files   

Reviewing Cluster Message Log Files   

Understanding the cluster.log File    

Understanding the hacmp.out Log File   

Viewing Compiled hacmp.out Event Summaries   

Understanding the System Error Log    

Understanding the Cluster History Log File   

Understanding the Cluster Manager Debug Log File   

Understanding the cspoc.log File   

Collecting Cluster Log Files for Problem Reporting   

Tracking Resource Group Parallel and Serial Processing in the hacmp.out File   

Serial Processing Order Reflected in Event Summaries   

Parallel Processing Order Reflected in Event Summaries   

Job Types: Parallel Resource Group Processing   

Disk Fencing with Serial or Parallel Processing   

Processing in Clusters with Dependent Resource Groups or Sites   

Managing a Node’s HACMP Log File Parameters   

Logging for clcomd   

Redirecting HACMP Cluster Log Files   

Steps for Redirecting a Cluster Log File   

Chapter 3: Investigating System Components and Solving Common Problems

Overview   

Investigating System Components   

Checking Highly Available Applications   

Checking the HACMP Layer   

Checking HACMP Components   

Checking for Cluster Configuration Problems   

Checking a Cluster Snapshot File   

Checking the Logical Volume Manager   

Checking Volume Group Definitions   

Checking the Varyon State of a Volume Group   

Checking Physical Volumes   

Checking Filesystems   

Checking Mount Points, Permissions, and Filesystem Information   

Checking the TCP/IP Subsystem   

Checking Point-to-Point Connectivity   

Checking the IP Address and Netmask   

Checking Heartbeating over IP Aliases   

Checking ATM Classic IP Hardware Addresses    

Checking the AIX 5L Operating System   

Checking Physical Networks   

Checking Disks, Disk Adapters, and Disk Heartbeating Networks   

Recovering from PCI Hot Plug NIC Failure   

Checking Disk Heartbeating Networks   

Checking the Cluster Communications Daemon   

Checking System Hardware   

HACMP Installation Issues   

Cannot Find Filesystem at Boot Time   

cl_convert Does Not Run Due to Failed Installation   

Configuration Files Could Not Be Merged during Installation   

HACMP Startup Issues   

ODMPATH Environment Variable Not Set Correctly   

clinfo Daemon Exits after Starting   

Node Powers Down; Cluster Manager Will Not Start   

configchk Command Returns an Unknown Host Message   

Cluster Manager Hangs during Reconfiguration   

clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes   

Pre- or Post-Event Does Not Exist on a Node after Upgrade   

Node Fails During Configuration with “869” LED Display   

Node Cannot Rejoin Cluster after Being Dynamically Removed   

Resource Group Migration Is Not Persistent after Cluster Startup   

SP Cluster Does Not Startup after Upgrade to HACMP 5.4   

Disk and Filesystem Issues   

AIX 5L Volume Group Commands Cause System Error Reports   

Verification Fails on Clusters with Disk Heartbeating Networks   

varyonvg Command Fails on a Volume Group   

cl_nfskill Command Fails   

cl_scdiskreset Command Fails   

fsck Command Fails at Boot Time   

System Cannot Mount Specified Filesystems   

Cluster Disk Replacement Process Fails   

Automatic Error Notification Fails with Subsystem Device Driver   

Filesystem Change Not Recognized by Lazy Update   

Network and Switch Issues   

Unexpected Network Interface Failure in Switched Networks   

Cluster Nodes Cannot Communicate   

Distributed SMIT Causes Unpredictable Results   

Token-Ring Network Thrashes   

System Crashes Reconnecting MAU Cables after a Network Failure   

TMSCSI Will Not Properly Reintegrate when Reconnecting Bus   

Recovering from PCI Hot Plug NIC Failure   

Unusual Cluster Events Occur in Non-Switched Environments   

Cannot Communicate on ATM Classic IP Network   

Cannot Communicate on ATM LAN Emulation Network   

IP Label for HACMP Disconnected from AIX 5L Interface   

TTY Baud Rate Setting Wrong   

First Node Up Gives Network Error Message in hacmp.out   

Network Interface Card and Network ODMs Out of Sync with Each Other   

Non-IP Network, Network Adapter or Node Failures   

Networking Problems Following HACMP Fallover   

Packets Lost during Data Transmission   

Verification Fails when Geo Networks Uninstalled   

Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks   

Cluster Communications Issues   

Message Encryption Fails   

Cluster Nodes Do Not Communicate with Each Other   

HACMP Takeover Issues   

varyonvg Command Fails during Takeover    

Highly Available Applications Fail   

Node Failure Detection Takes Too Long   

HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L   

Group Services Sends GS_DOM_MERGE_ER Message   

cfgmgr Command Causes Unwanted Behavior in Cluster   

Releasing Large Amounts of TCP Traffic Causes DMS Timeout   

Deadman Switch Causes a Node Failure   

Deadman Switch Time to Trigger   

A “device busy” Message Appears after node_up_local Fails   

Network Interfaces Swap Fails Due to an rmdev “device busy” Error   

MAC Address Is Not Communicated to the Ethernet Switch   

Client Issues   

Network Interface Swap Causes Client Connectivity Problem   

Clients Cannot Access Applications   

Clients Cannot Find Clusters   

Clinfo Does Not Appear to Be Running   

Clinfo Does Not Report That a Node Is Down   

Miscellaneous Issues   

Limited Output when Running the tail -f Command on /tmp/hacmp.out    

CDE Hangs after IPAT on HACMP Startup   

Cluster Verification Gives Unnecessary Message   

config_too_long Message Appears   

Console Displays SNMP Messages   

Device LEDs Flash “888” (System Panic)   

Unplanned System Reboots Cause Fallover Attempt to Fail   

Deleted or Extraneous Objects Appear in NetView Map   

F1 Does not Display Help in SMIT Panels   

/usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large   

View Event Summaries Does Not Display Resource Group Information as Expected   

Application Monitor Problems   

Cluster Disk Replacement Process Fails   

Resource Group Unexpectedly Processed Serially   

rg_move Event Processes Several Resource Groups at Once   

Filesystem Fails to Unmount   

Dynamic Reconfiguration Sets a Lock   

WebSMIT Does Not “See” the Cluster   

Appendix A: Script Utilities

Appendix B: Command Execution Language Guide

Appendix C: HACMP Tracing


PreviousNextIndex