PreviousNextIndex

Appendix B: Applications and HACMP


This appendix addresses some of the key issues to consider when making your applications highly available under HACMP.

HACMP allows you to configure clusters with multi-tiered applications by establishing dependencies between resource groups containing different applications. This appendix describes resource group dependencies and how they can help with keeping dependent applications highly available.

For details on the planning for and configuring of applications, see Chapter 2: Initial Cluster Planning and the Administration Guide.

For general information about keeping a cluster running on a 7x24 basis, see Appendix A: 7x24 Maintenance in the Administration Guide.

Overview

Besides understanding the hardware and software needed to make a cluster highly available, you will need to spend some time on application considerations when planning your HACMP environment. The goal of clustering is to keep your important applications available despite any single point of failure. To achieve this goal, it is important to consider the aspects of an application that make it recoverable under HACMP.

There are few hard and fast requirements that an application must meet to recover well under HACMP. For the most part, there are simply good practices that can head off potential problems. Some required characteristics, as well as a number of suggestions, are discussed here. These are grouped according to key points that should be addressed in all HACMP environments. This appendix covers the following application considerations:

  • Automation—making sure your applications start and stop without user intervention
  • Dependencies—knowing what factors outside HACMP affect the applications
  • Interference—knowing that applications themselves can hinder HACMP functioning
  • Robustness—choosing strong, stable applications
  • Implementation—using appropriate scripts, file locations, and cron schedules.
  • You should add an application monitor to detect a problem with application startup. An application monitor in startup monitoring mode checks an application server’s successful startup within the specified stabilization interval and exits after the stabilization period expires.

    In HACMP 5.4, you can start the HACMP cluster services on the node(s) without stopping your applications, by selecting an option from the SMIT panel Manage HACMP Services > Start Cluster Services. When starting, HACMP relies on the application startup scripts and configured application monitors, to ensure that HACMP knows about the running application and does not start a second instance of the application.

    For more information on configuring application monitoring and steps needed to start cluster services without stopping the applications, see the Administration Guide.

    Similarly, you can stop HACMP cluster services and leave the applications running on the nodes. When the node that has been stopped and placed in an unmanaged state rejoins the cluster, the state of the resources is assumed to be the same unless a user initiates an HACMP resource group command to bring the resource group into another state (for example, online on an active node).

    At the end of this appendix, you will find two examples of popular applications—Oracle Database™ and SAP R/3™—and some issues to consider when implementing these applications in an HACMP environment.

    Application Automation: Minimizing Manual Intervention

    One key requirement for an application to function successfully under HACMP is that the application be able to start and stop without any manual intervention.

    Application Start Scripts

    Create a start script that starts the application. The start script should perform any “clean-up” or “preparation” necessary to ensure proper startup of the application, and also properly manage the number of instances of the application that need to be started. When the Application Server is add to a Resource Group. HACMP calls this script to bring the application online as part of processing the Resource Group. Since the cluster daemons call the start script, there is no option for interaction. Additionally, upon an HACMP fallover, the recovery process calls this script to bring the application online on a standby node. This allows for a fully automated recovery, and is why any necessary cleanup and/or preparation should be included in this script.

    HACMP calls the start script as the “root” user. It may be necessary to change to a different user in order to start the application. The su command can accomplish this. Also, it may be necessary to run nohup on commands that are started in the background and have the potential to be terminated upon exit of the shell.

    For example, an HACMP cluster node may be a client in a Network Information Service (NIS) environment. If this is the case and you need to use the su command to change user id, there must be a route to the NIS master at all times. In the event that a route doesn’t exist and the su is attempted, the application script hangs. You can avoid this by enabling the HACMP cluster node to be an NIS slave. That way, a cluster node has the ability to access its own NIS map files to validate a user ID.

    The start script should also check for the presence of required resources or processes. This will ensure an application can start successfully. If the necessary resources are not available, a message can be sent to the administration team to correct this and restart the application.

    Start scripts should be written so that they check if one instance of the application is already running and do not start another instance unless multiple instances are desired. Keep in mind that the start script may be run after a primary node has failed. There may be recovery actions necessary on the backup node in order to restart an application. This is common in database applications. Again, the recovery must be able to run without any interaction from administrators.

    Also see the notes in the section Writing Effective Scripts.

    Application Stop Scripts

    The most important aspect of an application stop script is that it completely stop an application. Failure to do so many prevent HACMP from successfully completing a takeover of resources by the backup nodes. In stopping, the script may need to address some of the same concerns the start script addresses, such as NIS and the su command.

    The application stop script should use a phased approach. The first phase should be an attempt to stop the cluster services and bring resource groups offline. If processes refuse to terminate, the second phase should be used to forcefully ensure all processing is stopped. Finally, a third phase can use a loop to repeat any steps necessary to ensure that the application has terminated completely.

    Be sure that your application stop script exits with the value 0 (zero) when the application has been successfully stopped. In particular, examine what happens if you run your stop script when the application is already stopped. Your script must exit with zero in this case as well. If your stop script exits with a different value, this tells HACMP that the application is still running, although possibly in a damaged state. The event_error event will be run and the cluster will enter an error state. This check alerts administrators that the cluster is not functioning properly.

    Keep in mind that HACMP allows 360 seconds by default for events to complete processing. A message indicating the cluster has been in reconfiguration too long appears until the cluster completes its reconfiguration and returns to a stable state. This warning may be an indication that a script is hung and requires manual intervention. If this is a possibility, you may wish to consider stopping an application manually before stopping HACMP.

    You can change the time period before the config_too_long event is called. For more information about how to change this setting, see the section Tuning Event Duration Time Until Warning in Chapter 6: Configuring Cluster Events in the Administration Guide.

    Application Start and Stop Scripts and Dependent Resource Groups

    In HACMP 5.2 and up support for dependent resource groups lets you configure the following:

  • Three levels of dependencies between resource groups, for example a configuration in which node A depends on node B, and node B depends on node C. HACMP prevents you from configuring circular dependencies.
  • A type of dependency in which a parent resource group must be online on any node in the cluster before a child (dependent) resource group can be activated on a node.
  • Note: If two applications must run on the same node, both applications must reside in the same resource group.

    If a child resource group contains an application that depends on resources in the parent resource group, then upon fallover conditions, if the parent resource group falls over to another node, the child resource group is temporarily stopped and automatically restarted. Similarly, if the child resource group is concurrent, HACMP takes it offline temporarily on all nodes, and brings it back online on all available nodes. If the fallover of the parent resource group is not successful, both the parent and the child resource groups go into an ERROR state.

    Note that when the child resource group is temporarily stopped and restarted, the application that belongs to it is also stopped and restarted. Therefore, to minimize the chance of data loss during the application stop and restart process, customize your application server scripts to ensure that any uncommitted data is stored to a shared disk temporarily during the application stop process and read back to the application during the application restart process. It is important to use a shared disk as the application may be restarted on a node other than the one on which it was stopped.

    Application Tier Issues

    Often, applications have a multi-tiered architecture (for example, a database tier, an application tier, and a client tier). Consider all tiers of an architecture if one or more is made highly available through the use of HACMP.

    For example, if the database is made highly available, and a fallover occurs, consider whether actions should be taken at the higher tiers in order to automatically return the application to service. If so, it may be necessary to stop and restart application or client tiers. This can be facilitated in one of two ways. One way is to run clinfo on the tiers, the other is to use a remote execution command such as rsh, rexec, or ssh.

    Note: Certain methods, such as the use of ~/.rhosts files, pose a security risk.

    Using Dependent Resource Groups

    To configure complex clusters with multi-tiered applications, you can use parent/child dependent resource groups. You may also want to consider using location dependencies. For more information, see the section Application Dependencies in this chapter, Chapter 2: Initial Cluster Planning, and Chapter 6: Planning Resource Groups.

    Using the Clinfo API

    clinfo is the cluster information daemon. You can write a program using the Clinfo API to run on any tiers that would stop and restart an application after a fallover has completed successfully. In this sense, the tier, or application, becomes “cluster aware,” responding to events that take place in the cluster. See the manual Programming Client Applications for more detail on the Clinfo API.

    Using Pre- and Post-Event Scripts

    Another way to address the issue of multi-tiered architectures is to use pre- and post-event scripts around a cluster event. These scripts would call a remote execution command such as rsh, rexec, or ssh to stop and restart the application.

    Application Dependencies

    Prior to HACMP 5.2, to achieve resource group and application sequencing, system administrators had to build the application recovery logic in their pre- and post-event processing scripts. Every cluster would be configured with a pre-event script for all cluster events, and a post-event script for all cluster events.

    Such scripts could become all-encompassing case statements. For example, if you want to take an action for a specific event on a specific node, you need to edit that individual case, add the required code for pre- and post-event scripts, and also ensure that the scripts are the same across all nodes.

    To summarize, even though the logic of such scripts captures the desired behavior of the cluster, they can be difficult to customize and even more difficult to maintain later on, when the cluster configuration changes.

    If you are using pre-and post-event scripts or other methods, such as resource group processing ordering to establish dependencies between applications that are supported by your cluster, then these methods may no longer be needed or can be significantly simplified. Instead, you can specify dependencies between resource groups in a cluster. This is especially true for HACMP 5.3, with improvements to default parallel processing and the addition of location dependencies.

    For an overview and figures representing both parent/child and location dependencies, see the Concepts and Facilities Guide.

    For planning multi-tiered applications, see Planning Considerations for Multi-Tiered Applications in Chapter 2: Initial Cluster Planning.

    For information on how to configure resource group dependencies, see the Administration Guide.

    Note: In many cases, applications depend on more than data and an IP address. For the success of any application under HACMP, it is important to know what the application should not depend upon in order to function properly. This section outlines many of the major dependency issues. Keep in mind that these dependencies may come from outside the HACMP and application environment. They may be incompatible products or external resource conflicts. Look beyond the application itself to potential problems within the enterprise.

    Locally Attached Devices

    Locally attached devices can pose a clear dependency problem. In the event of a fallover, if these devices are not attached and accessible to the standby node, an application may fail to run properly. These may include a CD-ROM device, a tape device, or an optical juke box. Consider whether your application depends on any of these and if they can be shared between cluster nodes.

    Hard Coding

    Hard coding an application to a particular device in a particular location, creates a potential dependency issue. For example, the console is typically assigned as /dev/tty0. Although this is common, it is by no means guaranteed. If your application assumes this, ensure that all possible standby nodes have the same configuration.

    Hostname Dependencies

    Some applications are written to be dependent on the AIX 5L hostname. They issue a command in order to validate licenses or name filesystems. The hostname is not an IP address label. The hostname is specific to a node and is not failed over by HACMP. It is possible to manipulate the hostname, or use hostname aliases, in order to trick your application, but this can become cumbersome when other applications, not controlled by HACMP, also depend on the hostname.

    Software Licensing

    Another possible problem is software licensing. Software can be licensed to a particular CPU ID. If this is the case with your application, it is important to realize that a fallover of the software will not successfully restart. You may be able to avoid this problem by having a copy of the software resident on all cluster nodes. Know whether your application uses software that is licensed to a particular CPU ID.

    Application Interference

    Sometimes an application or an application environment may interfere with the proper functioning of HACMP. An application may execute properly on both the primary and standby nodes. However, when HACMP is started, a conflict with the application or environment could arise that prevents HACMP from functioning successfully.

    Software Using IPX/SPX Protocol

    A conflict may arise between HACMP and any software that binds a socket over a network interface. An example is the IPX/SPX protocol. When active, it binds an interface and prevents HACMP from properly managing the interface. Specifically, for ethernet and token ring, it inhibits the Hardware Address Takeover from completing successfully. A “device busy” message appears in the HACMP logs. The software using IPX/SPX must be either completely stopped or not used in order for Hardware Address Takeover to work.

    Products Manipulating Network Routes

    Additionally, products that manipulate network routes can keep HACMP from functioning as it was designed. These products can find a secondary path through a network that has had an initial failure. This may prevent HACMP from properly diagnosing a failure and taking appropriate recovery actions.

    AIX 5L Fast Connect

    You can reduce the problem of conflict with certain protocols, and the need for manual intervention, if you are using AIX 5L Fast Connect to share resources. The protocols handled by this application can easily be made highly available because of their integration with HACMP.

    AIX 5L Fast Connect software is integrated with HACMP in a similar way, so that it can be configured as a highly available resource. AIX 5L Fast Connect allows you to share resources between AIX 5L workstations and PCs running Windows, DOS, and OS/2 operating systems. Fast Connect supports the NetBIOS protocol over TCP/IP.

    For more information about configuring the application for HACMP, see the following resources:

  • Chapter 2: Initial Cluster Planning
  • The section Configuring AIX 5L Fast Connect in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide
  • The appropriate planning worksheets in Appendix A: Planning Worksheets.
  • Robustness of Application

    Of primary importance to the success of any application is the health, or robustness, of the application. If the application is unstable or crashing intermittently, resolve these issues before placing it in a high availability environment.

    Beyond basic stability, an application under HACMP should meet other robustness characteristics, such as described in the following sections Successful Start after Hardware Failure and Survival of Real Memory Loss.

    Successful Start after Hardware Failure

    A good application candidate for HACMP should be able to restart successfully after a hardware failure. Run a test on an application before putting it under HACMP. Run the application under a heavy load and fail the node. What does it take to recover once the node is back online? Can this recovery be completely automated? If not, the application may not be a good candidate for high availability.

    Survival of Real Memory Loss

    For an application to function well under HACMP it should be able to survive a loss of the contents of real memory. It should be able to survive the loss of the kernel or processor state. When a node failure occurs, these are lost. Applications should also regularly check-point the data to disk. In the event that a failure occurs, the application will be able to pick up where it last check-pointed data, rather than starting completely over.

    Application Implementation Strategies

    There are a number of aspects of an application to consider as you plan for implementing it under HACMP. Consider characteristics such as time to start, time to restart after failure, and time to stop. Your decisions in a number of areas, including those discussed in this section—script writing, file storage, /etc/inittab file and cron schedule issues—can improve the probability of successful application implementation.

    Writing Effective Scripts

    Writing smart application start scripts can also help reduce the likelihood of problems when bringing applications online.

    A good practice for start scripts is to check prerequisite conditions before starting an application. These may include access to a filesystem, adequate paging space and free filesystem space. The start script should exit and run a command to notify system administrators if the requirements are not met.

    When starting a database it is important to consider whether there are multiple instances within the same cluster. If this is the case, start only the instances applicable for each node. Certain database startup commands read a configuration file and start all known databases at the same time. This may not be a desired configuration for all environments.

    Warning: Be careful not to kill any HACMP processes as part of your script. If you are using the output of the ps command and using a grep to search for a certain pattern, make sure the pattern does not match any of the HACMP or RSCT processes.

    Considering File Storage Locations

    Give thought to where the configuration files reside. They could either be on shared disk, and therefore potentially accessed by whichever node has the volume group varied on, or on each node’s internal disks. This holds true for all aspects of an application. Certain files must be on shared drives. These include data, logs, and anything that could be updated by the execution of the application. Files such as configuration files or application binaries could reside in either location.

    There are advantages and disadvantages to storing optional files in either location. Having files stored on each node’s internal disks implies that you have multiple copies of, and potentially multiple licenses for, the application. This could require additional cost as well as maintenance in keeping these files synchronized. However, in the event that an application needs to be upgraded, the entire cluster need not be taken out of production. One node could be upgraded while the other remains in production. The “best” solution is the one that works best for a particular environment.

    Considering /etc/inittab and cron Table Issues

    Also give thought to applications, or resources needed by an application, that either start out of the /etc/inittab file or out of the cron table. The inittab starts applications upon boot up of the system. If cluster resources are needed for an application to function, they will not become available until after HACMP is started. It is better to use the HACMP application server facility that allows the application to be a resource that is started only after all dependent resources are online.

    Note: It is very important that the following be correct in /etc/inittab:
    hacmp:2:once:/usr/es/sbin/cluster/etc/re.init "a"
  • The clinit and pst_clinit entries must be the last entries of run level “2”.
  • The clinit entry must be before the pst_clinit entry.
  • Any tty entries must not be “on” or “respawn”.
  • An incorrect entry for these prevents HACMP from starting.

    In the cron table, jobs are started according to a schedule set in the table and the date setting on a node. This information is maintained on internal disks and thus cannot be shared by a standby node. Synchronize these cron tables so that a standby node can perform the necessary action at the appropriate time. Also, ensure the date is set the same on the primary node and any of its standby nodes.

    Examples: Oracle Database™ and SAP R/3™

    Here are two examples illustrating issues to consider in order to make the applications Oracle Database and SAP R/3 function well under HACMP.

    Example 1: Oracle Database

    The Oracle Database, like many databases, functions very well under HACMP. It is a robust application that handles failures well. It can roll back uncommitted transactions after a fallover and return to service in a timely manner. However, there are a few things to keep in mind when using Oracle Database under HACMP.

    Starting Oracle

    Oracle must be started by the Oracle user ID. Thus, the start script should contain an su - oracleuser. The dash (-) is important since the su needs to take on all characteristics of the Oracle user and reside in the Oracle user’s home directory. The command would look something like this:

    su - oracleuser -c “/apps/oracle/startup/dbstart” 
    

    Commands like dbstart and dbshut read the /etc/oratabs file for instructions on which database instances are known and should be started. In certain cases it is inappropriate to start all of the instances, because they may be owned by another node. This would be the case in the mutual takeover of two Oracle instances. The oratabs file typically resides on the internal disk and thus cannot be shared. If appropriate, consider other ways of starting different Oracle instances.

    Stopping Oracle

    The stopping of Oracle is a process of special interest. There are several different ways to ensure Oracle has completely stopped. The suggested sequence is this: first, implement a graceful shutdown; second, call a shutdown immediate, which is a bit more forceful method; finally, create a loop to check the process table to ensure all Oracle processes have exited.

    Oracle File Storage

    The Oracle product database contains several files as well as data. It is necessary that the data and redo logs be stored on shared disk so that both nodes may have access to the information. However, the Oracle binaries and configuration files could reside on either internal or shared disks. Consider what solution is best for your environment.

    Example 2: SAP R/3, a Multi-Tiered Application

    SAP R/3 is an example of a three-tiered application. It has a database tier, an application tier, and a client tier. Most frequently, it is the database tier that is made highly available. In such a case, when a fallover occurs and the database is restarted, it is necessary to stop and restart the SAP application tier. You can do this in one of two ways:

  • Using a remote execution command such as rsh, rexec, or ssh
  • Note: Certain methods, such as the use of ~/.rhosts files, pose a security risk.
  • Making the application tier nodes “cluster aware.”
  • Using a Remote Execution Command

    The first way to stop and start the SAP application tier is to create a script that performs remote command execution on the application nodes. The application tier of SAP is stopped and then restarted. This is done for every node in the application tier. Using a remote execution command requires a method of allowing the database node access to the application node.

    Note: Certain methods, such as the use of ~/.rhosts files, pose a security risk.

    Making Application Tier Nodes “Cluster Aware”

    A second method for stopping and starting the application tier is to make the application tier nodes “cluster aware.” This means that the application tier nodes are aware of the clustered database and know when a fallover occurs. You can implement this by making the application tier nodes either HACMP servers or clients. If the application node is a server, it runs the same cluster events as the database nodes to indicate a failure; pre- and post-event scripts could then be written to stop and restart the SAP application tier. If the application node is an HACMP client, it is notified of the database fallover via SNMP through the cluster information daemon (clinfo). A program could be written using the Clinfo API to stop and restart the SAP application tier.

    Consult the manual Programming Client Applications for more detail on the Clinfo API.


    PreviousNextIndex