Disaster recovery and high availability

This topic describes the tier 6 solutions for high availability and data currency when recovering from a disaster.

Peer-to-peer remote copy (PPRC) and extended remote copy (XRC)

PPRC and XRC are both 3990-6 hardware solutions that provide data currency to secondary, remote volumes. Updates made to secondary DASD are kept in time sequence. This ensures that updates are applied consistently across volumes. PPRC and XRC also ensure that updates are applied in time sequence across control units as well. This sequencing offers a very high degree of data integrity across volumes behind different control units.

Because PPRC and XRC are hardware solutions, they are application, and data, independent. The data can be DB2®, VSAM, IMS™, or any other type of data. All your vital data on DASD can be duplicated off-site. This reduces the complexity of recovery. These solutions can also make use of redundant array of independent disks (RAID) DASD to deliver the highest levels of data integrity and availability.

PPRC synchronously shadows updates from the primary to the secondary site. This ensures that no data is lost between the data committed on the primary and secondary DASD. The time taken for the synchronous write to the secondary unit has an impact on your application, increasing response time. This additional time (required for each write operation) is approximately equivalent to a DASD fastwrite operation. Because the implementation of PPRC is almost entirely in the 3990-6, you must provide enough capacity for cache and non-volatile storage (NVS) to ensure optimum performance.

XRC is an asynchronous implementation of remote copy. The application updates the primary data as usual, and XRC then passes the updates to the secondary site. The currency of the secondary site lags slightly behind the primary site because of updates in transit. As part of XRC data management, updates to the secondary site are performed in the same sequence as at the primary site. This ensures data integrity across controllers and devices. Because XRC does not wait for updates to be made at the secondary site, the application’s performance is not directly affected. XRC uses cache and non-volatile storage, so you must provide enough capacity to ensure optimum performance.

In the event of a disaster, check the state of all secondary volumes to ensure data consistency against the shadowed log data sets. This ensures that the same sequence of updates is maintained on the secondary volumes as on the primary volumes up to the point of the disaster. Because PPRC and XRC do not require restores or forward recovery of data, your restart procedures on the secondary system may be the same as for a short-term outage at the primary site, such as a power outage.

When running with PPRC or XRC, the data you replicate along with the databases includes:

CICS applications can use non-DASD storage for processing data. If your application depends on this type of data, be aware that PPRC and XRC do not handle it.

For more information on PPRC and XRC, see Planning for IBM® Remote Copy, SG24-2595-00, and DFSMS/MVS Remote Copy Administrator’s Guide and Reference.

PPRC or XRC?

You need to choose between PPRC and XRC for transmitting data to your backup site. This topic compares the two methods to help you make your choice.

Choose PPRC as your remote copy facility if you:

The synchronous nature of PPRC ensures that, if you have a disaster at your main site, you lose only inflight transactions. The committed data recorded at the remote site is the same as that at the primary site.

Use PPRC for high value transactions

Consider PPRC if you deal with high value transactions, and data integrity in a disaster is more important to you than day-to-day performance. PPRC is more likely to be the solution for you if you characterize your business as being low volume, high value transactions; for example, a system supporting payments of thousands, or even millions, of dollars.

Choose XRC as your remote copy facility if you:

The asynchronous nature of XRC means that the remote site may have no knowledge of transactions that ran at the primary site, or does not know that they completed successfully. XRC ensures that the data recorded at the remote site is consistent (that is, it looks like a snapshot of the data at the primary site, but the snapshot may be several seconds old).

Use XRC for high volume transactions

Consider XRC if you deal with low value transactions, and data integrity in a disaster is less important to you than day-to-day performance. XRC is more likely to be the solution for you if you characterize your business as being high volume, low value transactions; for example, a system supporting a network of ATMs, where there is a high volume of transactions, but each transaction is typically less than 200 dollars in value.

Other benefits of PPRC and XRC

PPRC or XRC may eliminate the need for disaster recovery backups to be taken at the primary site, or to be taken at all. PPRC allows you to temporarily suspend the copying of updates to the secondary site. This allows you to suspend updates at the secondary site so that you can make image copies or backups of the data there. After the backups are complete, you can reestablish the pairing of data sets on the primary and secondary sites. Updates to the primary that have been recorded by the 3990-6 are applied to the secondary to resynchronize the pair.

XRC supports the running of concurrent copy sessions to its secondary volumes. This enables you to create a point-in-time copy of the data.

PPRC and XRC also allow you to migrate data to another or larger DASD of similar geometry, behind the same or different control units at the secondary site. This can be done for workload management or DASD maintenance, for example.

Forward recovery

Whether you use PPRC or XRC, you have two fundamental choices:

  1. You can pair volumes containing both the data and the log records
  2. You can pair only the volumes containing the log records

In the first case you should be able to perform an emergency restart of your systems, and restore service very rapidly. In the latter case you would need to use the log records, along with an image copy transmitted separately, to perform a forward recovery of your data, followed by an emergency restart.

Pairing the data volumes, as well as the log volumes, costs more because you have more data flowing between the sites, and therefore you need a greater bandwidth to support the flow. In theory you can restart much faster than if you have to perform a forward recovery. When deciding which to use, you must determine whether this method is significantly faster, and whether you think it is worth the additional costs.

Remote Recovery Data Facility

The Remote Recovery Data Facility (RRDF), Version 2 Release 1, a product of the E-Net Corporation, minimizes data loss and service outage time in the event of a disaster by providing a real-time remote logging function. Real-time remote logging provides data currency at the remote site, enabling the remote site to recover databases within seconds of the outage--typically in less than 1 second.

RRDF runs in its own address space. It provides programs that run in the CICS or database management system address space. These programs are invoked through standard interfaces--for example, at exit points associated with writing out log records.

The programs that run in the CICS or database management system address space use MVS™ cross-memory services to move log data to the RRDF address space. The RRDF address space maintains a virtual storage queue at the primary site for records awaiting transmission, with provision for spill files if communication between the primary and secondary sites is interrupted. Remote logging is only as effective as the currency of the data that is sent off-site. RRDF transports log stream data to a remote location in real-time, within seconds of the log operation at the primary site.

When the RRDF address space at the remote site receives the log data, it formats it into archived log data sets. Once data has been stored at the remote site, you can use it as needed to meet business requirements. The recovery process uses standard recovery utilities. For most data formats, first use the log data transmitted by RRDF in conjunction with a recent image copy of the data sets and databases that you have to recover. Then perform a forward recovery. If you are using DB2, you have the option of applying log records to the remote copies of your databases as RRDF receives the log records.

If you use DB2, you can use the optional RRDF log apply feature. With this feature you can maintain a logically consistent "shadow" copy of a DB2 database at the remote site. The RRDF log apply feature updates the shadow database at selected intervals, using log data transmitted from the primary site. Thus restart time is shortened because the work needed after a disaster is minimal. The currency of the data depends on the log data transmitted by RRDF and on how frequently you run the RRDF log apply feature. The RRDF log apply feature also enhances data availability, as you have read access to the shadow copy through a remote site DB2 subsystem. RRDF supports DB2 remote logging for all environments, including TSO, IMS, CICS, batch, and call attach.

At least two RRDF licenses are required to support the remote site recovery function, one for the primary site and one for the remote site. For details of RRDF support needed for the CICS Transaction Server, see Remote Recovery Data Facility support.

Choosing between RRDF and 3990-6 solutions

Table 4 summarizes the characteristics of the products you can use to implement a tier 6 solution. You must decide which solution or solutions is most appropriate for your environment.

Table 4. Selecting a tier 6 implementation This table compares the strengths of the tier 6 solutions.
RRDF 3990-6
Data type supported Various data sets1 Any on DASD
Database shadowing Optional. Available for DB2 and IDMS only Optional
Forward recovery required Yes Depends on implementation
Distance limitation None About 40 km for ESCON. Unlimited for XRC with channel extenders
Note:
1 Data sets managed by CICS file control and the DB2, IMS, IDMS, CPCS, ADABAS, and SuperMICR database management systems.

Disaster recovery personnel considerations

When planning for disaster recovery, you need to consider personnel issues.

You should ensure that a senior manager is designated as the disaster recovery manager. The recovery manager must make the final decision whether to switch to a remote site, or to try to rebuild the local system (this is especially true if you have adopted a solution that does not have a warm or hot standby site).

You must decide who will run the remote site, especially during the early hours of the disaster. If your recovery site is a long way from the primary site, many of your staff will be in the wrong place.

Finally, and to show the seriousness of disaster recovery, it is possible that some of your key staff may be severely injured and unable to take part in the recovery operation. Your plans need to identify backups for all your key staff.

Returning to your primary site

One aspect of disaster recovery planning which can be overlooked is the need to include plans for returning operations from the recovery site back to the primary site (or to a new primary site if the original primary site cannot be used again). Build the return to normal operations into your plan. The worst possible time to create a plan for moving back to your primary site is after a disaster. You will probably be far too busy to spend time building a plan. As a result, the return to your primary site may be delayed, and may not work properly.

Disaster recovery further information

If you require more information regarding the recovery of specific types of data, see the following publications:

See also the ITSC Disaster Recovery Library: Planning Guide for information that should help you set up a detailed disaster recovery plan if you use a combination of databases, such as DB2 and IMS.

[[ Contents Previous Page | Next Page Index ]]