PreviousNextIndex

Chapter 2: Initial Cluster Planning


This chapter describes the initial steps you take to plan an HACMP cluster to make applications highly available, including completing the initial planning worksheets. This chapter contains the following sections:

  • Prerequisites
  • Overview
  • Planning Cluster Security
  • Planning Cluster Nodes
  • Planning Cluster Sites
  • Application Planning
  • Planning Considerations for Multi-Tiered Applications
  • Planning Applications and Application Servers
  • Planning for AIX 5L Fast Connect
  • Planning for Highly Available Communication Links
  • Drawing a Cluster Diagram
  • Where You Go from Here.
  • Prerequisites

    Before you start HACMP planning, make sure that you understand the concepts and terminology relevant to HACMP. In addition, thoroughly read and understand all concepts in the Concepts and Facilities Guide and the Master Glossary.

    Overview

    An HACMP cluster provides a highly available environment for mission-critical applications. In many organizations, these applications must remain available at all times. For example, an HACMP cluster could run a database server program that services client applications, keeping it highly available for clients that send queries to the server program.

    To create effective clusters, fill out the planning worksheets provided in Appendix A: Planning Worksheets. These worksheets help you to ensure that you include the necessary components in a cluster, and provide documentation for future reference.

    During the cluster-planning process, the Online Planning Worksheets application enables you to enter configuration data and save it to a cluster definition file. At the end of the planning process, you can use the cluster definition file to immediately configure your cluster. For information about Online Planning Worksheets, see Chapter 9: Using Online Planning Worksheets.

    If you are creating a basic two-node cluster, you can do so by using the Two-Node Cluster Configuration Assistant. For information about the Two-Node Cluster Configuration Assistant, see the chapter on Creating a Basic HACMP in the Installation Guide.

    Planning Cluster Security

    HACMP provides cluster security by:

  • Controlling user access to HACMP
  • Providing security for inter-node communications.
  • Managing User Account Security

    For information about configuring user accounts, see Chapter 16: Managing Users and Groups in the Administration Guide.

    Managing Cluster Security

    This section provides an overview of cluster security. For detailed information about configuring HACMP cluster security, see Chapter 17: Managing Cluster Security in the Administration Guide.

    To secure inter-node communications, you can configure:

  • Connection Authentication
  • Message Authentication and Encryption.
  • Connection Authentication

    HACMP provides connection authentication to protect HACMP communications between cluster nodes. There are two types of connection authentication:

  • Standard authentication (default). Standard authentication includes verified connections by IP address and limits the commands that can be run with root privilege. This mode uses the principle of least-privilege for remote command execution, ensuring that no arbitrary command can run on a remote node with root privilege. A select set of HACMP commands is considered trusted and allowed to run as root; all other commands run as user nobody. Starting with HACMP 5.1, the ~/.rhosts dependency for inter-node communication was eliminated.
  • Kerberos authentication. Kerberos authentication can be used in addition to standard security mode on SP systems.
  • You can also configure a virtual private network (VPN) for inter-node communications. If you use a VPN, use persistent labels for VPN tunnels.

    Message Authentication and Encryption

    HACMP provides security for HACMP messages sent between cluster nodes as follows:

  • Message authentication ensures the origination and integrity of a message.
  • Message encryption changes the appearance of the data as it is transmitted and returns it to its original form when received by a node that authenticates the message.
  • HACMP supports the following types of encryption keys for message authentication and encryption:

  • Message Digest 5 (MD5) with Data Encryption Standard (DES)
  • MD5 with Triple DES
  • MD5 with Advanced Encryption Standard (AES).
  • Select an encryption algorithm that is compatible with the security methodology used by your organization.

    Planning Cluster Nodes

    For each critical application, be mindful of the resources required by the application, including its processing and data storage requirements. For example, when you plan the size of your cluster, include enough nodes to handle the processing requirements of your application after a node fails.

    You can create HACMP clusters that include up to 32 nodes. The Cluster Site Worksheet, provided in Appendix A: Planning Worksheets, is useful for this task.

    Keep in mind the following considerations when determining the number of cluster nodes:

  • An HACMP cluster can be made up of any combination of IBM eServer pSeries workstations, LPARs, and SP nodes. Ensure that all cluster nodes do not share components that could be a single point of failure (for example, a power supply). Similarly, do not place nodes on a single rack.
  • Create small clusters that consist of nodes that perform similar functions or share resources. Smaller, simple clusters are easier to design, implement, and maintain.
  • For performance reasons, it may be desirable to use multiple nodes to support the same application. To provide mutual takeover services, the application must be designed in a manner that allows multiple instances of the application to run on the same node.
  • For example, if an application requires that the dynamic data reside in a directory called /data, chances are that the application cannot support multiple instances on the same processor. For such an application (running in a non-concurrent environment), try to partition the data so that multiple instances of the application can run—each accessing a unique database.
    Furthermore, if the application supports configuration files that enable the administrator to specify that the dynamic data for instance1 of the application resides in the data1 directory, instance2 resides in the data2 directory, and so on, then multiple instances of the application are probably supported.
  • In certain configurations, including additional nodes in the cluster design can increase the level of availability provided by the cluster; it also gives you more flexibility in planning node fallover and reintegration.
  • The most reliable cluster node configuration is to have at least one standby node.
  • Choose cluster nodes that have enough I/O slots to support redundant network interface cards and disk adapters.
  • Remember, the cluster composed of multiple nodes is still more expensive than a single node, but without planning to support redundant hardware, (such as enough I/O slots for network and disk adapters), the cluster will have no better availability.
  • Use nodes with similar processing speed.
  • Use nodes with the sufficient CPU cycles and I/O bandwidth to allow the production application to run at peak load. Remember, nodes should have enough capacity to allow HACMP to operate.
  • To plan for this, benchmark or model your production application, and list the parameters of the heaviest expected loads. Then choose nodes for an HACMP cluster that will not exceed 85% busy, when running your production application.

    When you create a cluster, you assign a name to it. HACMP associates this name with the HACMP-assigned cluster ID.

    Planning Cluster Sites

    Cluster configurations typically use one site. If you have multiple sites, in addition to HACMP, use one of the following HACMP/XD features for disaster recovery:

  • HACMP/XD for AIX 5L for Metro Mirror
  • HACMP/XD for AIX 5L for GLVM
  • HACMP/XD for AIX 5L for HAGEO.
  • You also plan for sites if you intend to use cross-site LVM mirroring.

    You configure sites as part of your cluster configuration process.

    Planning Resources and Site Policy

    HACMP tries to ensure that the primary instance of a resource group is maintained online at one site, and the secondary instance is maintained online at the other site. Plan which nodes to configure at which site, and where you want the active applications to run, so you can plan the resource group policies accordingly.

    For more information on planning resource groups in a configuration with sites, see the section Planning Resource Groups in Clusters with Sites in Chapter 6: Planning Resource Groups.

    For more information on migrating resource groups with replicated resources, see the Administration Guide.

    All resources defined to HACMP must have unique names, as enforced by SMIT. The service IP labels, volume groups and resource group names must be both unique within the cluster and distinct from each other. The name of a resource should relate to the application it serves, as well as to any corresponding device, such as websphere_service_address.

    HACMP/XD for GLVM Mirroring Overview

    High Availability Cluster Multi-Processing Extended Distance (HACMP/XD) for Geographic Logical Volume Manager (GLVM) provides disaster recovery and data mirroring capability for the data at geographically separated sites. It protects the data against total site failure by remote mirroring, and supports unlimited distance between participating sites.

    HACMP/XD for GLVM increases data availability by providing continuing service during hardware or software outages (or both), planned or unplanned, for a two-site cluster that serially accesses mirrored volume groups across an unlimited distance over an IP-based network.

    HACMP/XD for GLVM provides two major facilities:

  • Remote data mirroring. HACMP/XD for GLVM creates a remote mirror copy of the data that the application can access both at the local site and at the remote site.
  • The software protects critical data by mirroring of the non-concurrent volume groups to which the application sends I/O requests. The application can access the same data regardless of whether the application is running on the local or remote site.

    The data mirroring function utilizes the mirroring capabilities of the AIX 5L Logical Volume Manager along with the mirroring function of the Geographic Logical Volume Manager that is provided by the HACMP/XD for GLVM software.

  • Integration with HACMP. By integrating with HACMP, HACMP/XD for GLVM keeps mission-critical systems and applications operational in the event of disasters. It manages the fallover and fallback of the resource group that contains the application.
  • Upon a node, a network interface, or a site failure, HACMP/XD for GLVM moves the resource group to another node. The node may belong either to the same site or to a remote site. With this operation, a complete, up-to-date copy of the volume group’s data remains available for the application to use on the node at the same site or at another site.
    When the application is moved to a remote site, HACMP/XD for GLVM continues to mirror the data.
    See the HACMP/XD for GLVM Planning and Administration Guide for details.

    HAGEO for AIX 5L Overview

    High Availability Geographic Cluster for AIX 5L (HAGEO) is a component of HACMP/XD. HAGEO extends an HACMP cluster to encompass two physically separate data centers. Data entered at one site is sent across a point-to-point TCP/IP network and mirrored at a second, geographically distant location.

    For example, data resulting from a bank transaction in New York City can be mirrored in Washington, DC. Each site can be a backup data center for the other, maintaining an updated copy of essential data and running key applications. If a disaster disables one site, the data is available within minutes at the other site. This capability of the HAGEO software is called geographic mirroring (geo-mirroring).

    HAGEO is a logical extension to the HACMP software. HACMP ensures that the computing environment within a site remains highly available. The HAGEO software ensures that the critical data remains highly available even if an entire site fails or is destroyed by a disaster.

    Data entered on nodes at one site is written locally and mirrored to nodes at the other site. If a site fails, HAGEO automatically notifies the system administrator and makes the geo-mirrored data available at the remote site.

    When the failed site has recovered, starting HAGEO on a node at the reintegrating site automatically reintegrates the site into the cluster. The geo-mirrored data is synchronized between the sites during the reintegration.

    The HAGEO cluster thus increases the level of availability provided by the HACMP software by enabling it to recognize and handle a site failure, to continue processing even though one of the sites has failed, and to reintegrate the failed site back into the cluster.

    The HACMP software provides site event scripts to customize how HACMP responds when a site goes up or down. For information about site event scripts, see Chapter 7: Planning for Cluster Events.

    For complete information about the HAGEO product and how it is integrated with HACMP, see the HAGEO documentation at the following URL:

    http://www.ibm.com/servers/eserver/pseries/library/hacmp_hiavgeo.html

    HACMP/XD for Metro Mirror Overview

    IBM HACMP/XD for Metro Mirror increases data availability for IBM TotalStorage Enterprise Storage Server (ESS) volumes that use Peer-to-Peer Remote Copy (PPRC) to copy data to a remote site for disaster recovery purposes. HACMP/XD for Metro Mirror takes advantage of the PPRC fallover/fallback functions and HACMP cluster management to reduce downtime and recovery time during disaster recovery.

    HACMP/XD Metro Mirror for SVC provides a fully automated, highly available disaster recovery management solution that takes advantage of the SAN Volume Controller’s ability to provide virtual disks derived from varied disk subsystems. The HACMP interface is designed so that once the basic SVC environment is configured, PPRC relationships are created automatically; no additional access to the SVC interface is needed.

    See the HACMP/XD: PPRC Planning and Administration Guide for complete information.

    Cross-Site LVM Overview

    Cross-site LVM mirroring replicates data between the disk subsystem at each site for disaster recovery. You can set up disks located at two different sites for remote mirroring.

    A storage area network (SAN) is a high-speed network that allows the establishment of direct connections between storage devices and processors (servers) within the distance supported by Fibre Channel. Thus, two or more servers (nodes) located at different sites can access the same physical disks, which can be separated by some distance as well, through the common SAN.

    These remote disks can be combined into a volume group via the AIX 5L Logical Volume Manager and this volume group can be imported to the nodes located at different sites. The logical volumes in this volume group can have up to three mirrors. Thus you can set up a mirror at each site. The information stored on this logical volume is kept highly available, and in case of certain failures, the remote mirror at another site will still have the latest information, so the operations can be continued on the other site.

    HACMP automatically synchronizes mirrors after a disk or node failure and subsequent reintegration. HACMP handles the automatic mirror synchronization even if one of the disks is in the PVREMOVED or PVMISSING state. Automatic synchronization is not possible for all cases, but you can use C-SPOC to synchronize the data from the surviving mirrors to stale mirrors after a disk or site failure and subsequent reintegration.

    Completing the Cluster Site Worksheet

    You use HACMP sites if you use cross-site LVM mirroring or any of the HACMP/XD components. Use the Cluster Site Worksheet to plan your HACMP sites. Note that related site information appears on the Resource Group Worksheet.

    To complete the Cluster Site Worksheet:

      1. Record the Site Name. Use no more than 32 alphanumeric characters and underscores.
      2. Record the names of the cluster nodes that belong to the site in the Cluster Nodes in Site list. The nodes must have the same names as those you define to the HACMP cluster. A site can have 1–7 nodes. A node can belong to only one site.
    With sites configured for HAGEO, you can have up to eight nodes configured.
    With sites configured for HACMP/XD for GLVM or HACMP/XD for Metro Mirror, you can have as many nodes in a cluster as HACMP supports.
      3. Decide on site dominance and record it in the Site Dominance field. (Consult the documentation for the HACMP/XD component you plan to use.)
      4. Select the site backup communication method and record it in the Site Backup Communication Method field. (Consult the documentation for the HACMP/XD component you plan to use.)
      5. Record the Inter-site Management Policy on the Resource Group Worksheet. The default is Ignore. You can also select Online on Either Site, Online on Both Sites or Prefer Primary Site. For complete information on inter-site management policies and how they work with other resource group attributes and runtime policies, see Chapter 6: Planning Resource Groups.

    Application Planning

    Before you start planning for an application, be sure you understand the data resources for your application and the location of these resources within the cluster in order to provide a solution that enables them to be handled correctly if a node fails. To prevent a failure, you must thoroughly understand how the application behaves in a single-node and multi-node environment. Do not make assumptions about the application's performance under adverse conditions.

    Use nodes with the sufficient CPU cycles and I/O bandwidth to allow the production application to run at peak load. Remember, nodes should have enough capacity to allow HACMP to operate.

    To plan for this, benchmark or model your production application, and list the parameters of the heaviest expected loads. Then choose nodes for an HACMP cluster that will not exceed 85% busy, when running your production application.

    We recommend that you configure multiple application monitors for an application and direct HACMP to both:

  • Monitor the termination of a process or more subtle problems affecting an application
  • Automatically attempt to restart the application and take appropriate action (notification or fallover) if restart attempts fail.
  • This section explains how to record all the key information about your application or applications on an Application Worksheet, provided in Appendix A: Planning Worksheets,and begin drawing your cluster diagram.

    Keep in mind the following guidelines to ensure that your applications are serviced correctly within an HACMP cluster environment:

  • Lay out the application and its data so that only the data resides on shared external disks. This arrangement not only prevents software license violations, but it also simplifies failure recovery.
  • If you are planning to include multi-tiered applications in parent/child dependent resource groups in your cluster, see the section Planning Considerations for Multi-Tiered Applications. If you are planning to use location dependencies to keep certain applications on the same node, or on different nodes, see the section Resource Group Dependencies in Chapter 6: Planning Resource Groups.
  • Write robust scripts to both start and stop the application on the cluster nodes. The startup script especially must be able to recover the application from an abnormal termination, such as a power failure. Ensure that it runs properly in a single-node environment before including the HACMP software. Be sure to include the start and stop resources on both the Application Worksheet and the Application Server Worksheet in Appendix A: Planning Worksheets.
  • Confirm application licensing requirements. Some vendors require a unique license for each processor that runs an application, which means that you must license-protect the application by incorporating processor-specific information into the application when it is installed. As a result, even though the HACMP software processes a node failure correctly, it may be unable to restart the application on the fallover node because of a restriction on the number of licenses for that application available within the cluster. To avoid this problem, be sure that you have a license for each system unit in the cluster that may potentially run an application.
  • Ensure that the application runs successfully in a single-node environment. Debugging an application in a cluster is more difficult than debugging it on a single processor.
  • Verify that the application uses a proprietary locking mechanism if you need concurrent access.
  • For more information about what types of applications work best under HACMP and for strategies that can help keep your applications highly available, see Appendix B: Applications and HACMP.

    Planning for Capacity Upgrade on Demand

    Capacity Upgrade on Demand (CUoD) is one of the facilities of DLPAR (Dynamic Logical Partitioning) on some of the pSeries IBM servers that lets you activate preinstalled but yet inactive (and unpaid for) processors as resource requirements change. The additional CPUs and memory, while physically present, are not used until you decide that the additional capacity you need is worth the cost. This provides you with a fast and easy upgrade in capacity to meet peak or unexpected loads.

    HACMP 5.2 and up integrates with the Dynamic Logical Partitioning (DLPAR) and CUoD functions. You can configure cluster resources in a way where the logical partition with minimally allocated resources serves as a standby node, and the application resides on another LPAR node that has more resources than the standby node.

    When it is necessary to run the application on the standby node, HACMP ensures that the node has the sufficient resources to successfully run the application and allocates the necessary resources. The resources can be allocated from two sources:

  • The free pool. The DLPAR function provides the resources to the standby node, by allocating the resources available in the free pool on the frame.
  • CUoD provisioned resources. If there are not enough available resources in the free pool that can be allocated through DLPAR, the CUoD function provides additional resources to the standby node, should the application require more memory or CPU.
  • When you configure HACMP to use resources through DLPAR and CUoD, the LPAR nodes in the cluster do not use any additional resources until the resources are required by the application.
    HACMP ensures that each node can support the application with reasonable performance at a minimum cost. This way, you can upgrade the capacity of the logical partition in cases when your application requires more resources, without having to pay for idle capacity until you actually need it.
    For information on how to plan and configure CUoD in HACMP, see the Administration Guide.

    Application Servers

    To put the application under HACMP control, you create an application server resource that associates a user-defined name with the names of specially written scripts to start and stop the application. By defining an application server, HACMP can start another instance of the application on the takeover node when a fallover occurs. This protects your application so that it does not become a single point of failure. An application server can also be monitored using the application monitoring facility or the Application Availability Analysis tool.

    After you define the application server, you can add it to a resource group. A resource group is a set of resources that you define so that the HACMP software can treat them as a single unit. For information about planning to add your application servers and other resources to resource groups, see Chapter 6: Planning Resource Groups.

    Applications Integrated with HACMP

    Certain applications, including Fast Connect Services and Workload Manager, can be configured directly as highly available resources, without application servers or additional scripts. In addition, HACMP cluster verification ensures the correctness and consistency of certain aspects of your Fast Connect Services, or Workload Manager configuration.

    Later sections in this chapter describe how these applications work in HACMP and how to plan for configuring them.

    HACMP Smart Assist Programs

    HACMP 5.4 offers three HACMP Smart Assist applications to help you easily integrate these applications into an HACMP cluster:

  • Smart Assist for WebSphere. Extends an existing HACMP configuration to include monitoring and recovery support for various WebSphere components.
  • Smart Assist for DB2. Extends an existing HACMP configuration to include monitoring and recovery support for DB2 Universal Database (UDB) Enterprise Server Edition.
  • Smart Assist for Oracle. Provides assistance to those involved with the installation of Oracle® Application Server 10g (9.0.4) (AS10g) Cold Failover Cluster (CFC) solution on IBM AIX 5L™ (5200) operating system.
  • For more information about the HACMP Smart Assists, see the section Accessing Publications in About This Guide.

    Application Monitoring

    HACMP can monitor applications that are defined to application servers, in one of two ways:

  • Process monitoring detects the termination of a process, using RSCT Resource Monitoring and Control (RMC) capability.
  • Custom monitoring monitors the health of an application, using a monitor method that you define.
  • You can configure multiple application monitors and associate them with one or more application servers. You can assign each monitor a unique name in SMIT. By supporting multiple monitors per application, HACMP can support more complex configurations. For example, you can configure one monitor for each instance of an Oracle parallel server in use. Otherwise, you can configure a custom monitor to check the health of the database along with a process termination monitor to instantly detect termination of the database process.

    For information about how to define application monitors using the SMIT interface, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.

    You can use the Application Availability Analysis tool to measure the exact amount of time that any of your HACMP-defined applications is available. The HACMP software collects, time stamps, and logs the following information:

  • An application monitor is defined, changed, or removed
  • An application starts, stops, or fails
  • A node fails or is shut down, or comes up
  • A resource group is taken offline or moved
  • Application monitoring via multiple monitors is suspended or resumed.
  • For more information about this tool, see the section Measuring Application Availability in Chapter 10: Monitoring an HACMP Cluster in the Administration Guide.

    Planning Considerations for Multi-Tiered Applications

    Business configurations that use multi-tiered applications can utilize parent/child dependent resource groups. For example, the database must be online before the application server. In this case, if the database goes down and is moved to a different node the resource group containing the application server would have to be brought down and back up on any node in the cluster.

    Environments such as SAP require applications to be cycled (stopped and then started again) whenever a database fails. There are many application services provided by an environment like SAP, and the individual application components often need to be controlled in a specific order.

    Establishing interdependencies between resource groups is also useful when system services are required to support application environments. Services such as cron jobs for pruning log files or for initiating backups need to move from one node to another along with an application, but typically are not initiated until the application is established. These services can be built into application server start and stop scripts, or they can be controlled through pre- and post- event processing. However, dependent resource groups simplify the way you configure system services to be dependent upon applications they serve.

    Note: To minimize the chance of data loss during the application stop and restart process, customize your application server scripts to ensure that any uncommitted data is stored to a shared disk temporarily during the application stop process and read back to the application during the application restart process. It is important to use a shared disk as the application may be restarted on a node other than the one on which it was stopped.

    You can also configure resource groups with location dependencies so that certain resource groups are kept ONLINE on the same node, at the same site, or on different nodes at startup, fallover, and fallback. For information on location dependencies between resource groups, see Chapter 6: Planning Resource Groups.

    Planning Applications and Application Servers

    This section contains the following topics:

  • Completing the Application Worksheet
  • Completing the Application Server Worksheet
  • Completing the Application Monitoring Worksheet (Process Monitor or Custom Monitor).
  • Completing the Application Worksheet

    Print the Application Worksheet from Appendix A: Planning Worksheets, and fill it out using the information in this section. Print one copy for each application you want to keep highly available in the cluster.

    To complete the Application Worksheet:

      1. Assign a name to the application and record it in the Application Name field.
      2. Enter information describing the application’s executable and configuration files under the Directory/Path, Filesystem, Location, and Sharing columns. Enter the full path name of each file. You can store the filesystem for either the executable or configuration files on either an internal or external disk device. Different situations may require you to do it one way or the other. Be aware, if you store the filesystem on the internal device, the device will not be accessible to other nodes during a resource takeover.
      3. Enter information describing the application’s data and log files under the appropriate columns listed in Step 2. Data and log files can be stored in a filesystem (or on a logical device) and must be stored externally if a resource takeover is to occur successfully.
      4. Enter in the Normal Start Command/Procedures field the names of the start scripts you created to start the application after a resource takeover.
      5. Enter in the Verification Commands/Procedures field the names of commands or procedures to use to ensure that the normal start scripts ran successfully.

    Completing the Application Server Worksheet

    Print the Application Server Worksheet from Appendix A: Planning Worksheets, and fill it out using the information in this section. Print one copy for each application in the cluster.

    To complete the Application Server Worksheet:

      1. Enter the cluster name in the Cluster Name field.
    You determined this value while completing the worksheet.
    For each application server you define, fill in the following fields:
  • Assign and record a name for the application in the Application field.
  • Assign a symbolic name that identifies the server and record it in the Server Name field. For example, you could name the application server for the customer database application custdata. The name can include up to 64 characters and can contain only alphanumeric and underscore (_) characters.
  • Record the full pathname of the user-defined script that starts the application server in the Start Script field. (Maximum 256 characters.) This information was recorded in the Application Worksheet. Be sure to include the script’s arguments, if necessary. The script is called by the cluster event scripts. For example, you could name the start script and specify its arguments for starting the custdata application server as follows:
  • /usr/start_custdata -d mydir -a jim_svc

    where the -d option specifies the name of the directory for storing images, and the -a option specifies the service IP address (label) for the server running the demo.

  • Record the full pathname of the user-defined script that stops the application in the Stop Script field. (Maximum 256 characters.) This script is called by the cluster event scripts. For example, you could name the stop script for the custdata application server /usr/stop_custdata.
  • Completing the Application Monitoring Worksheet

    This section describes how to complete the following worksheets:

  • Completing the Application Monitor (Process) Worksheet
  • Completing the Application Monitor (Custom) Worksheet.
  • When you use an application monitor with HACMP, HACMP restarts the application. The Systems Resource Controller (SRC), should not restart it.

    If a monitored application is controlled by the SRC, ensure that action:multi is set as follows:

    -O
    Specifies that the subsystem is not restarted if it stops abnormally.
    -Q
    Specifies that multiple instances of the subsystem are not allowed to run at the same time.

    To review these settings, use the following command:

    lssrc -Ss Subsystem | cut -d : -f 10,11

    If the values are not -O and -Q, use the chssys command to change them.

    Completing the Application Monitor (Process) Worksheet

    Print the Application Monitor Worksheet (Process Monitor) from Appendix A: Planning Worksheets, and fill it out using the information in this section. Print as many copies as you need. You can configure multiple application monitors and associate them with one or more application servers. You can assign each monitor a unique name in SMIT.

    To complete the Application Monitor Worksheet (Process Monitor):

      1. Enter the cluster name in the Cluster Name field.
      2. Specify the Application Server Name for which you are configuring a process monitor.
      3. Ensure whether this application can be monitored with a process monitor.
    For example, shell scripts cannot be monitored. For more information about application monitoring, see the section Configuring Application Monitoring in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
    If the application can be monitored, proceed to step 4.
    If the application is not to be monitored, proceed to the instructions for the Application Monitor Worksheet (Custom Monitor).
      4. Indicate the name(s) of one or more processes to be monitored. Be careful when listing process names. It is very important that the names are correct when you enter them in SMIT to configure the application monitor. For information about identifying process names, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
      5. Specify the user ID of the owner of the processes specified in step 4 (for example, root). Note that the process owner must own all processes to be monitored.
      6. Specify how many instances of the application to monitor. The default is 1 (one) instance. This number must be 1 (one) if you have specified more than one process to monitor.
      7. Specify the time (in seconds) to wait before beginning monitoring. For instance, with a database application, you may wish to delay monitoring until after the start script and initial database search have been completed. You may need to experiment with this value to balance performance with reliability. In most circumstances, this value should not be zero.
    Note: Allow at least 2–3 seconds before beginning application monitoring, to ensure HACMP and user applications quiesce.
      8. Specify the Restart Count, denoting the number of times to attempt to restart the application before taking any other actions. The default is 3. Make sure you enter a Restart Method (see step 13) if your Restart Count is any non-zero value.
      9. Specify the interval (in seconds) that the application must remain stable before resetting the Restart Count. This interval becomes important if a number of failures occur over a period of time. Resetting the count to zero at the proper time prevents a later failure from being counted as the last failure from the previous problem, in cases when it should be counted as the first of a new problem.
    Do not set this interval to be shorter than (Restart Count) x (Stabilization Interval). The default is 10 percent longer than that value. If it is too short, the count will be reset to zero repeatedly, and the specified failure action will never occur.
      10. Specify the action to be taken if the application cannot be restarted within the Restart Count. The default choice is notify, which runs an event to inform the cluster of the failure. You can also specify fallover, in which case the resource group containing the failed application moves over to the cluster node with the next-highest priority for that resource group.
    Note: Keep in mind that if you choose the fallover option of application monitoring, which may cause resource groups to migrate from their original owner node, the possibility exists that while the highest priority node is up, the resource group remains down. This situation occurs when an rg_move event moves a resource group from its highest priority node to a lower priority node, and then you stop cluster services on the lower priority node and bring the resource groups offline.
    Unless you bring up the resource group manually, it will remain in an inactive state. For more information, see the section on Application Monitoring Prerequisites and Considerations in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
      11. (Optional) Define a notify method that will run when the application fails. This user-defined method, typically a shell script, runs during the restart process and during notify activity.
      12. (Optional) Specify an application cleanup script to be called when a failed application is detected, before calling the restart method. The default is the application server stop script you define when you set up the application server.
    Note: Since the application is already stopped when this script is called, the server stop script may fail. For more information on writing correct stop scripts, see Appendix B: Applications and HACMP.
      13. (Required if Restart Count is not zero.) The default restart method is the application server start script you define when the application server is set up. Specify a different restart method if desired.

    Completing the Application Monitor (Custom) Worksheet

    Print the Application Monitor Worksheet (Custom Monitor) from Appendix A: Planning Worksheets, and fill it out using the information in this section. If you plan to set up a custom monitor method, complete this worksheet for each user-defined application monitor you plan to configure. You can configure multiple application monitors and associate them with one or more application servers. You can assign each monitor a unique name in SMIT.

    To complete the Application Monitor Worksheet (Custom Monitor):

      1. Enter the cluster name in the Cluster Name field.
      2. Fill in the name of the application server.
      3. Specify a script or executable for custom monitoring of the health of the specified application. Do not leave this field blank when you configure the monitor in SMIT. The monitor method must return a zero value if the application is healthy, and a non-zero value if a problem is detected. See the note in Step 6 regarding defining a monitor method.
      4. Specify the polling interval (in seconds) for how often the monitor method is to be run. If the monitor does not respond within this interval, it is considered “hung.”
      5. Specify a signal to kill the user-defined monitor method if it does not return within the monitor interval. The default signal is kill -9.
      6. Specify the time (in seconds) to wait before beginning monitoring. For instance, with a database application, you may wish to delay monitoring until after the start script and initial database search have been completed. You may need to experiment with this value to balance performance with reliability.
    Note: In most circumstances, this value is not be zero. Allow at least 2–3 seconds before beginning application monitoring, to ensure HACMP and user applications quiesce.
      7. Specify the restart count, denoting the number of times to attempt to restart the application before taking any other actions. The default is 3.
      8. Specify the interval (in seconds) that the application must remain stable before resetting the restart count. This interval becomes important if a number of failures occur over a period of time. Resetting the count to zero at the proper time keeps a later failure from being counted as the last failure from the previous problem, when it should be counted as the first of a new problem.
    Do not set this to be shorter than (Restart Count) x (Stabilization Interval + Monitor Interval). The default is 10 percent longer than that value. If it is too short, the count will be reset to zero repeatedly, and the specified failure action will never occur.
      9. Specify the action to be taken if the application cannot be restarted within the restart count. You can keep the default choice notify, which runs an event to inform the cluster of the failure, or choose fallover, in which case the resource group containing the failed application moves over to the cluster node with the next-highest priority for that resource group.
    Note: Keep in mind that if you choose the fallover option of application monitoring, which may cause resource groups to migrate from their original owner node, the possibility exists that while the highest priority node is up, the resource group remains down. This situation occurs when an rg_move event moves a resource group from its highest priority node to a lower priority node, and then you stop cluster services on the lower priority node and bring resource groups offline.
    Unless you manually bring up the resource group, it will remain in an inactive state. For more information, see the section on Application Monitoring Prerequisites and Considerations in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
      10. (Optional) Define a notify method that will run when the application fails. This custom method runs during the restart process and during a server_down event.
      11. (Optional) Specify an application cleanup script to be called when a failed application is detected, before calling the restart method. The default is the application server stop script defined when the application server was set up.
    Note: The application may be already stopped when this script is called, and the server stop script may fail. For more information on writing correct stop scripts, see Appendix B: Applications and HACMP.
      12. (Required if Restart Count is not zero.) The default restart method is the application server start script you define when the application server is set up. Specify a different restart method if desired.

    Notes on Defining a Custom Monitoring Method

    When defining your custom monitoring method, keep in mind the following points:

  • You can configure multiple application monitors and associate them with one or more application servers. You can assign each monitor a unique name in SMIT.
  • The monitor method must be an executable program (it can be a shell script) that tests the application and exits, returning an integer value that indicates the application’s status. The return value must be zero if the application is healthy, and must be a non-zero value if the application has failed.
  • HACMP does not pass arguments to the monitor method.
  • The monitor method logs messages to the /tmp/clappmond.application_monitor_name.monitor.log file by printing messages to the standard output (stdout) file. Each time the application runs, the monitor log file is overwritten.
  • Do not make the method overly complicated. The monitor method is killed if it does not return within the specified polling interval. Test your monitor method under different workloads to arrive at the best polling interval value.
  • Ensure that the System Resource Controller (SRC) is configured to restart the application and take steps accordingly. For more information on which steps to take, see Planning Applications and Application Servers.
  • Planning for AIX 5L Fast Connect

    Some applications, such as AIX 5L Fast Connect, do not require application servers because they are already integrated with HACMP. You do not need to write additional scripts or create an application server for these applications to be made highly available under HACMP.

    AIX 5L Fast Connect allows client PCs running Windows, DOS, and OS/2 operating systems to request files and print services from an AIX 5L server. Fast Connect supports the transport protocol NetBIOS over TCP/IP. You can use SMIT to configure AIX 5L Fast Connect resources.

    The Fast Connect application is integrated with HACMP. You can use SMIT to configure Fast Connect services as highly available resources in resource groups. HACMP can then stop and start the Fast Connect resources when fallover, recovery, and dynamic resource group migrations occur. This application does not need to be associated with application servers or special scripts.

    In addition, the HACMP cluster verification process ensures the accuracy and consistency of certain aspects of your AIX 5L Fast Connect configuration.

    Planning Considerations for Fast Connect

    To plan for configuration of Fast Connect as a cluster resource in HACMP:

  • Install the Fast Connect Server on all nodes in the cluster.
  • If Fast Connect printshares are to be highly available, ensure that the AIX 5L print queue names match for every node in the cluster.
  • For non-concurrent groups, assign the same NetBIOS name to each node when the Fast Connect Server is installed.
  • This action minimizes the steps needed for the client to connect to the server after fallover.
    Note: Only one instance of a NetBIOS name can be active at one time. For that reason, do not to activate Fast Connect servers that are under HACMP control.
  • For concurrently configured resource groups, assign different NetBIOS names across nodes.
  • In concurrent configurations, define a second, non-concurrent resource group to control any filesystem that must be available for the Fast Connect nodes.
  • Having a second resource group configured in a concurrent cluster keeps the AIX 5L filesystems used by Fast Connect cross-mountable and highly available in the event of a node failure.
  • Do not configure Fast Connect in a mutual takeover configuration.
  • A node cannot participate in two Fast Connect resource groups at the same time.

    Fast Connect as a Highly Available Resource

    When AIX 5L Fast Connect resources are configured as part of a resource group, HACMP handles them as described in the following list:

  • Fast Connect start and stop. When a Fast Connect server has resources configured in HACMP, HACMP starts and stops the server during fallover, recovery, and resource group reconfiguration or migration.
  • Note: The Fast Connect server must be stopped on all nodes when bringing up the cluster. This ensures that HACMP will start the Fast Connect server and handle its resources properly.
  • Node failure. When a node that owns Fast Connect resources fails, the resources become available on the takeover node. When the failed node rejoins the cluster, the resources are again available on the original node (as long as the resource policy is such that the failed node reacquires its resources).
  • Clients do not need to reestablish a connection to access the Fast Connect Server after fallover, as long as IP and Hardware Address Takeover (HWAT) are configured and occur, and users have configured their Fast Connect server with the same NetBIOS name on all nodes (for non-concurrent resources groups).
    For switched networks and for clients not running Clinfo, you may need to take additional steps to ensure client connections after fallover. For more information about configuration considerations for clients not running Clinfo, see Chapter 8: Planning for HACMP Clients.
  • Adapter failure. When a service adapter or network interface card running the transport protocol needed by the Fast Connect server fails, HACMP performs an adapter swap as usual, and Fast Connect establishes a connection with the new adapter. After an adapter failure, clients are temporarily unable to access shared resources such as files and printers. After the adapter swap is complete, clients can again access their resources.
  • Completing the Fast Connect Worksheet

    Complete the Fast Connect Worksheet to identify the resources to configure as Fast Connect resources. The worksheets are located in Appendix A: Planning Worksheets.

    To complete the Fast Connect Worksheet:

      1. Enter the cluster name in the Cluster Name field.
      2. Record the name of the resource group that will contain the Fast Connect Resources.
      3. Record the nodes participating in the resource group.
      4. Record the Fast Connect Resources to be made highly available. When you configure your resources from the SMIT, you will select these resources from a picklist.
      5. Record the filesystems that contain the files or directories that you want Fast Connect to share. Be sure to specify these in the Filesystems SMIT field when you configure the resource group.

    For more information about using AIX 5L Fast Connect, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.

    Planning for Highly Available Communication Links

    HACMP can provide high availability for the following types of communication links:

  • SNA configured over LAN network interface cards
  • SNA over X.25
  • Pure X.25.
  • LAN interface cards include Ethernet, Token Ring, and FDDI. These cards are configured as part of the HACMP cluster topology.

    X.25 cards are usually, although not always, used for WAN connections. They provide a mechanism to connect dissimilar machines, from mainframes to dumb terminals. The typical use of X.25 networks makes these cards a different class of devices that are not included in the cluster topology and not controlled by the standard HACMP topology management methods. This means that heartbeats are not used to monitor an X.25 card’s status, and you do not define X.25-specific networks in HACMP.

    Once defined in an HACMP resource group, communication links are protected in the same way other HACMP resources are. In the event of a LAN card or X.25 link failure, or general node or network failures, a highly available communication link falls over to another available card on the same node or on a takeover node.

    For clusters where highly available communication links have been configured, verification checks whether the appropriate fileset has been installed and reports an error in the following situations:

  • If SNA HA Communication Links is configured as a part of a resource group and the sna.rte fileset version 6.1.0.0 or higher is missing on some nodes that are part of this resource group
  • If X.25 HA Communication Links is configured as a part of a resource group and the sx25.rte fileset version 2.0.0.0 or higher is missing on some nodes that are part of this resource group
  • On SNA X.25 and up, if HA Communication Links is configured as a part of a resource group and both the sx25.rte filesets version 6.1.0.0 and 2.0.0.0 or higher is missing on some nodes that are part of this resource group.
  • SNA and X.25 Links Required Software and Hardware

    SNA links and X.25 links require additional configuration, outside of HACMP, to configure highly available communications links in HACMP. For more information about configuring the required software and hardware, see the manuals at the following URLs:

    www.ibm.com

    www.redbooks.ibm.com

    SNA links require the following:

  • CS/AIX 5L version 6.1 or higher is required.
  • SNA-over-LAN links are supported over Ethernet, Token Ring, and FDDI adapters.
  • X.25 links require the following:

  • AIXlink/X.25 version 2 or higher is required.
  • X.25 links are supported on the following adapters:
  • IBM 2-Port Multiprotocol Adapter (DPMP)
  • IBM Artic960Hx PCI Adapter (Arctic)
  • Completing the Communication Links Worksheets

    This section describes how to fill out the following worksheets:

  • Completing the Communication Links (SNA-Over-LAN) Worksheet
  • Completing the Communication Links (X.25) Worksheet
  • Completing the Communication Links (SNA-Over-X.25) Worksheet
  • Completing the Communication Links (SNA-Over-LAN) Worksheet

    Print the Communication Links (SNA-Over-LAN) Worksheet from Appendix A: Planning Worksheets, and fill it out using the information in this section. Complete one worksheet for each SNA-over-LAN communication link in your cluster.

    The SNA link must already be configured separately in AIX 5L before the link can be defined to HACMP. Much of the information you fill in here will be drawn from the AIX 5L configuration information.

    To complete the Communication Links (SNA-Over-LAN) Worksheet:

      1. Enter the cluster name in the Cluster Name field.
      2. Enter the resource group in which the communication link will be defined in the Resource Group field.
      3. Enter nodes participating in the resource group in the Nodes field.
      4. Enter the link name in the Name field.
      5. Enter the DLC name in the DLC Name field. This is the name of an existing DLC profile to be made highly available.
      6. Enter the names of any ports to be started automatically in the Port(s) field.
      7. Enter the names of the link stations in the Link Station(s) field.
      8. Enter the name of the Application Service File that this link uses to perform customized operations when this link is started or stopped.

    Completing the Communication Links (X.25) Worksheet

    Print the Communication Links (X.25) Worksheet from Appendix A: Planning Worksheets, and fill it out using the information in this section. Complete one worksheet for each X.25 communication link in your cluster.

    The X.25 worksheet is for configuring the link in HACMP. The X.25 adapter and link must already be configured separately in AIX 5L before the link can be defined for HACMP. Much of the information you fill in here will be drawn from the AIX 5L configuration information.

    To complete the Communication Links (SNA-Over-LAN) Worksheet:

      1. Enter the cluster name in the Cluster Name field.
      2. Enter the resource group in which the communication link will be defined in the Resource Group field.
      3. Enter nodes participating in the resource group in the Nodes field.
      4. Enter the link name in the Name field.
      5. Enter the X.25 Port to be used for this link (for example, sx25a0). The port name must be unique across the cluster. This name must begin with “sx25a” but the final numeric character is your choice. The port name can be up to eight characters long; therefore the final numeric can contain up to three digits.
      6. In the Address/NUA field, enter the X.25 address (local NUA) that will be used by this link.
      7. For Network ID, the default value is 5. Enter a different number here if needed.
      8. Enter the X.25 Country Code, or leave it blank and the system default will be used.
      9. For Adapter Name(s), identify the communication adapters you want this link to be able to use. Enter HACMP names, not device names. In SMIT, you select an entry for this field from a picklist.
      10. Enter the name of the Application Service File that this link uses to perform customized operations when this link is started or stopped.

    Completing the Communication Links (SNA-Over-X.25) Worksheet

    Print the Communication Links (SNA-Over-X.25) Worksheet from Appendix A: Planning Worksheets, and fill it out using the information in this section. Complete one worksheet for each SNA-over-X.25 communication link in your cluster. The SNA-Over-X.25 worksheet is for configuring the link in HACMP. The SNA link and the X.25 adapter and link must already be configured separately in AIX 5L before the SNA-over-X.25 link can be defined for HACMP. Much of the information you fill in here will be drawn from the AIX 5L configuration information.

    To complete the Communication Links (SNA-Over-X.25) Worksheet:

      1. Enter the cluster name in the Cluster Name field.
      2. Enter the resource group in which the communication link will be defined in the Resource Group field.
      3. Enter nodes participating in the resource group in the Nodes field.
      4. Enter the link name in the Name field.
      5. Enter the X.25 Port to be used for this link (for example, sx25a0). The port name must be unique across the cluster. This name must begin with “sx25a” but the final numeric character is your choice. The port name can be up to eight characters long; therefore the final numeric can contain up to three digits.
      6. In the X.25 Address/NUA field, enter the X.25 address (local NUA) that will be used by this link.
      7. For X.25 Network ID, the default value will be 5. Enter a different number here if needed.
      8. Enter the X.25 Country Code, or leave it blank and the system default will be used.
      9. For X.25 Adapter Name(s), identify the communication adapters you want this link to be able to use. Enter HACMP names, not device names. In SMIT, you select an entry for this field from a picklist.
      10. Enter the SNA DLC. This is the name of an existing DLC profile to be made highly available. In SMIT, you select an entry for this field from a picklist.
      11. Enter the names of any SNA ports to be started automatically in the SNA Port(s) field.
      12. Enter the names of the SNA link stations in the SNA Link Station(s) field.
      13. Enter the name of the Application Service File that this link uses to perform customized operations when this link is started or stopped.
    For information on how to write an appropriate script for this file, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.

    Drawing a Cluster Diagram

    The cluster diagram combines the information from each step in the planning process into one drawing that shows the cluster’s function and structure.

    The following illustration shows a mixed cluster that includes a rack-mounted system and standalone systems. The diagram uses rectangular boxes to represent the slots supported by the nodes. If your cluster uses thin nodes, darken the outline of the nodes and include two nodes to a drawer. For wide nodes, use the entire drawer. For high nodes, use the equivalent of two wide nodes. Keep in mind that each thin node contains an integrated Ethernet connection.

    Begin drawing this diagram by identifying the cluster name and the applications that are being made highly available. Next, darken the outline of the nodes that will make up the cluster. Include the name of each node. While reading subsequent chapters, you will add information about networks and disk storage subsystems to the diagram.

    Initial Diagram of Two Types of HACMP Clusters 
    

    Where You Go from Here

    Next you will design the network connectivity of your cluster, described in Chapter 3: Planning Cluster Network Connectivity.


    PreviousNextIndex