Table of Contents

Troubleshooting Guide

Contents

About This Guide

Chapter 1: Troubleshooting HACMP Clusters

Troubleshooting an HACMP Cluster Overview

Becoming Aware of the Problem

Determining a Problem Source

Stopping the Cluster Manager

Using the AIX Data Collection Utility

Checking a Cluster Configuration with Online Planning Worksheets

Using HACMP Diagnostic Utilities

Verifying Expected Behavior

Using the Problem Determination Tools

HACMP Verification

Viewing Current State

HACMP Log Viewing and Management

Recovering from HACMP Script Failure

Restoring HACMP Configuration Database from an Active Configuration

Release Locks Set by Dynamic Reconfiguration

Clear SSA Disk Fence Registers

HACMP Cluster Test Tool

HACMP Trace Facility

HACMP Event Emulation

HACMP Error Notification

Opening a SMIT Session on a Node

Configuring Cluster Performance Tuning

Setting I/O Pacing

Setting Syncd Frequency

Resetting HACMP Tunable Values

Prerequisites and Limitations

Listing Tunable Values

Resetting HACMP Tunable Values Using SMIT

Resetting HACMP Tunable Values using the Command Line

Sample Custom Scripts

Making cron jobs Highly Available

Making Print Queues Highly Available

Where You Go from Here

Chapter 2: Using Cluster Log Files

Viewing HACMP Cluster Log Files

Reviewing Cluster Message Log Files

Understanding the cluster.log File

Understanding the hacmp.out Log File

Viewing Compiled hacmp.out Event Summaries

Understanding the System Error Log

Understanding the Cluster History Log File

Understanding the Cluster Manager Debug Log File

Understanding the cspoc.log File

Collecting Cluster Log Files for Problem Reporting

Tracking Resource Group Parallel and Serial Processing in the hacmp.out File

Serial Processing Order Reflected in Event Summaries

Parallel Processing Order Reflected in Event Summaries

Job Types: Parallel Resource Group Processing

Disk Fencing with Serial or Parallel Processing

Processing in Clusters with Dependent Resource Groups or Sites

Managing a Node’s HACMP Log File Parameters

Logging for clcomd

Redirecting HACMP Cluster Log Files

Steps for Redirecting a Cluster Log File

Chapter 3: Investigating System Components and Solving Common Problems

Overview

Investigating System Components

Checking Highly Available Applications

Checking the HACMP Layer

Checking HACMP Components

Checking for Cluster Configuration Problems

Checking a Cluster Snapshot File

Checking the Logical Volume Manager

Checking Volume Group Definitions

Checking the Varyon State of a Volume Group

Checking Physical Volumes

Checking Filesystems

Checking Mount Points, Permissions, and Filesystem Information

Checking the TCP/IP Subsystem

Checking Point-to-Point Connectivity

Checking the IP Address and Netmask

Checking Heartbeating over IP Aliases

Checking ATM Classic IP Hardware Addresses

Checking the AIX 5L Operating System

Checking Physical Networks

Checking Disks, Disk Adapters, and Disk Heartbeating Networks

Recovering from PCI Hot Plug NIC Failure

Checking Disk Heartbeating Networks

Checking the Cluster Communications Daemon

Checking System Hardware

HACMP Installation Issues

Cannot Find Filesystem at Boot Time

cl_convert Does Not Run Due to Failed Installation

Configuration Files Could Not Be Merged during Installation

HACMP Startup Issues

ODMPATH Environment Variable Not Set Correctly

clinfo Daemon Exits after Starting

Node Powers Down; Cluster Manager Will Not Start

configchk Command Returns an Unknown Host Message

Cluster Manager Hangs during Reconfiguration

clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes

Pre- or Post-Event Does Not Exist on a Node after Upgrade

Node Fails During Configuration with “869” LED Display

Node Cannot Rejoin Cluster after Being Dynamically Removed

Resource Group Migration Is Not Persistent after Cluster Startup

SP Cluster Does Not Startup after Upgrade to HACMP 5.4

Disk and Filesystem Issues

AIX 5L Volume Group Commands Cause System Error Reports

Verification Fails on Clusters with Disk Heartbeating Networks

varyonvg Command Fails on a Volume Group

cl_nfskill Command Fails

cl_scdiskreset Command Fails

fsck Command Fails at Boot Time

System Cannot Mount Specified Filesystems

Cluster Disk Replacement Process Fails

Automatic Error Notification Fails with Subsystem Device Driver

Filesystem Change Not Recognized by Lazy Update

Network and Switch Issues

Unexpected Network Interface Failure in Switched Networks

Cluster Nodes Cannot Communicate

Distributed SMIT Causes Unpredictable Results

Token-Ring Network Thrashes

System Crashes Reconnecting MAU Cables after a Network Failure

TMSCSI Will Not Properly Reintegrate when Reconnecting Bus

Recovering from PCI Hot Plug NIC Failure

Unusual Cluster Events Occur in Non-Switched Environments

Cannot Communicate on ATM Classic IP Network

Cannot Communicate on ATM LAN Emulation Network

IP Label for HACMP Disconnected from AIX 5L Interface

TTY Baud Rate Setting Wrong

First Node Up Gives Network Error Message in hacmp.out

Network Interface Card and Network ODMs Out of Sync with Each Other

Non-IP Network, Network Adapter or Node Failures

Networking Problems Following HACMP Fallover

Packets Lost during Data Transmission

Verification Fails when Geo Networks Uninstalled

Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks

Cluster Communications Issues

Message Encryption Fails

Cluster Nodes Do Not Communicate with Each Other

HACMP Takeover Issues

varyonvg Command Fails during Takeover

Highly Available Applications Fail

Node Failure Detection Takes Too Long

HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L

Group Services Sends GS_DOM_MERGE_ER Message

cfgmgr Command Causes Unwanted Behavior in Cluster

Releasing Large Amounts of TCP Traffic Causes DMS Timeout

Deadman Switch Causes a Node Failure

Deadman Switch Time to Trigger

A “device busy” Message Appears after node_up_local Fails

Network Interfaces Swap Fails Due to an rmdev “device busy” Error

MAC Address Is Not Communicated to the Ethernet Switch

Client Issues

Network Interface Swap Causes Client Connectivity Problem

Clients Cannot Access Applications

Clients Cannot Find Clusters

Clinfo Does Not Appear to Be Running

Clinfo Does Not Report That a Node Is Down

Miscellaneous Issues

Limited Output when Running the tail -f Command on /tmp/hacmp.out

CDE Hangs after IPAT on HACMP Startup

Cluster Verification Gives Unnecessary Message

config_too_long Message Appears

Console Displays SNMP Messages

Device LEDs Flash “888” (System Panic)

Unplanned System Reboots Cause Fallover Attempt to Fail

Deleted or Extraneous Objects Appear in NetView Map

F1 Does not Display Help in SMIT Panels

/usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large

View Event Summaries Does Not Display Resource Group Information as Expected

Application Monitor Problems

Cluster Disk Replacement Process Fails

Resource Group Unexpectedly Processed Serially

rg_move Event Processes Several Resource Groups at Once

Filesystem Fails to Unmount

Dynamic Reconfiguration Sets a Lock

WebSMIT Does Not “See” the Cluster

Appendix A: Script Utilities

Appendix B: Command Execution Language Guide

Appendix C: HACMP Tracing

Notices for HACMP Troubleshooting Guide

Index