Recovery requirements in a transaction processing system

An online system requires mechanisms that, together with suitable operating procedures, provide automatic recovery from failures and allow the system to restart with the minimum of disruption.

The two main recovery requirements of an online system are:

Maintaining the integrity of data

"Data integrity" means that the data is in the form you expect and has not been corrupted. The objective of recovery operations on files, databases, and similar data resources is to maintain and restore the integrity of the information. Recovery must also ensure consistency of related changes, whereby they are made as a whole or not at all. (The term resources used in this book, unless stated otherwise, refers to data resources.)

Logging changes

One way of maintaining the integrity of a resource is to keep a record, or log, of all the changes made to a resource while the system is executing normally. If a failure occurs, the logged information can help recover the data.

An online system can use the logged information in two ways:

  1. It can be used to back out incomplete or invalid changes to one or more resources. This is called backward recovery, or backout. For backout, it is necessary to record the contents of a data element before it is changed. These records are called before-images. In general, backout is applicable to processing failures that prevent one or more transactions (or a batch program) from completing.
  2. It can be used to reconstruct changes to a resource, starting with a backup copy of the resource taken earlier. This is called forward recovery. For forward recovery, it is necessary to record the contents of a data element after it is changed. These records are called after-images.

    In general, forward recovery is applicable to data set failures, or failures in similar data resources, which cause data to become unusable because it has been corrupted or because the physical storage medium has been damaged.

Minimizing the effect of failures

An online system should limit the effect of any failure. Where possible, a failure that affects only one user, one application, or one data set should not halt the entire system. Furthermore, if processing for one user is forced to stop prematurely, it should be possible to back out any changes made to any data sets as if the processing had not started.

If processing for the entire system stops, there may be many users whose updating work is interrupted. On a subsequent startup of the system, only those data set updates in process (in-flight) at the time of failure should be backed out. Backing out only the in-flight updates makes restart quicker, and reduces the amount of data to reenter.

Ideally, it should be possible to restore the data to a consistent, known state following any type of failure, with minimal loss of valid updating activity.

[[ Contents Previous Page | Next Page Index ]]